LLM-RL
Collection
RLHF to RLVR
•
5 items
•
Updated
This LoRA adaptor is a fine-tuned version of deepseek-ai/DeepSeek-V3.1 using Reinforcement Learning (RL) on the Hendrycks MATH dataset. It was trained using the Tinker framework.
The model was trained with the following hyperparameters:
The model expects the problem statement followed by a specific instruction suffix and relies on a few-shot prompting strategy during training.
Suffix: " Write your answer in \boxed{} format."
Example:
Test on MATH-500 datasets:
Base model
deepseek-ai/DeepSeek-V3.1-Base