DeepSeek-V3.1 Math RL (Tinker)

This LoRA adaptor is a fine-tuned version of deepseek-ai/DeepSeek-V3.1 using Reinforcement Learning (RL) on the Hendrycks MATH dataset. It was trained using the Tinker framework.

Model Details

  • Base Model: deepseek-ai/DeepSeek-V3.1
  • Training Method: Reinforcement Learning (likely Group Relative Policy Optimization / PPO-group based on group size configuration)
  • Dataset: Hendrycks MATH (Training split, filtering out MATH-500 test set)
  • Language: English
  • Task: Mathematical Problem Solving

Training Configuration

The model was trained with the following hyperparameters:

  • Environment: math (Hendrycks MATH)
  • Group Size: 16 (Number of samples generated per prompt for advantage estimation)
  • Groups Per Batch: 64
  • Learning Rate: 2e-4
  • Max Generation Tokens: 512
  • Framework: Tinker Cookbook

Usage

Prompt Format

The model expects the problem statement followed by a specific instruction suffix and relies on a few-shot prompting strategy during training.

Suffix: " Write your answer in \boxed{} format."

Example:

  • Problem: “How many r’s are in strawberry?”
  • Prompt: problem + " Write your answer in \boxed{} format."

Metrics during training

Screenshot 2026-01-31 at 13.30.53

Test on MATH-500 datasets:

Screenshot 2026-01-31 at 13.31.40

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

Finetuned
(22)
this model

Dataset used to train Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

Collection including Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA · Hugging Face

DeepSeek-V3.1 Math RL (Tinker)

This LoRA adaptor is a fine-tuned version of deepseek-ai/DeepSeek-V3.1 using Reinforcement Learning (RL) on the Hendrycks MATH dataset. It was trained using the Tinker framework.

Model Details

  • Base Model: deepseek-ai/DeepSeek-V3.1
  • Training Method: Reinforcement Learning (likely Group Relative Policy Optimization / PPO-group based on group size configuration)
  • Dataset: Hendrycks MATH (Training split, filtering out MATH-500 test set)
  • Language: English
  • Task: Mathematical Problem Solving

Training Configuration

The model was trained with the following hyperparameters:

  • Environment: math (Hendrycks MATH)
  • Group Size: 16 (Number of samples generated per prompt for advantage estimation)
  • Groups Per Batch: 64
  • Learning Rate: 2e-4
  • Max Generation Tokens: 512
  • Framework: Tinker Cookbook

Usage

Prompt Format

The model expects the problem statement followed by a specific instruction suffix and relies on a few-shot prompting strategy during training.

Suffix: " Write your answer in \boxed{} format."

Example:

  • Problem: “How many r’s are in strawberry?”
  • Prompt: problem + " Write your answer in \boxed{} format."

Metrics during training

Screenshot 2026-01-31 at 13.30.53

Test on MATH-500 datasets:

Screenshot 2026-01-31 at 13.31.40

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

Finetuned
(22)
this model

Dataset used to train Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

Collection including Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA