DeepSeek-V3.1 Math RL (Tinker)

This LoRA adaptor is a fine-tuned version of deepseek-ai/DeepSeek-V3.1 using Reinforcement Learning (RL) on the Hendrycks MATH dataset. It was trained using the Tinker framework.

Model Details

Base Model: deepseek-ai/DeepSeek-V3.1
Training Method: Reinforcement Learning (likely Group Relative Policy Optimization / PPO-group based on group size configuration)
Dataset: Hendrycks MATH (Training split, filtering out MATH-500 test set)
Language: English
Task: Mathematical Problem Solving

Training Configuration

The model was trained with the following hyperparameters:

Environment: math (Hendrycks MATH)
Group Size: 16 (Number of samples generated per prompt for advantage estimation)
Groups Per Batch: 64
Learning Rate: 2e-4
Max Generation Tokens: 512
Framework: Tinker Cookbook

Usage

Prompt Format

The model expects the problem statement followed by a specific instruction suffix and relies on a few-shot prompting strategy during training.

Suffix: " Write your answer in \boxed{} format."

Example:

Problem: “How many r’s are in strawberry?”
Prompt: problem + " Write your answer in \boxed{} format."

Metrics during training

Test on MATH-500 datasets:

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

Base model

deepseek-ai/DeepSeek-V3.1-Base

Quantized

deepseek-ai/DeepSeek-V3.1

Finetuned

(22)

this model

Dataset used to train Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

Collection including Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

LLM-RL

Collection

RLHF to RLVR • 5 items • Updated 6 days ago

DeepSeek-V3.1 Math RL (Tinker)

This LoRA adaptor is a fine-tuned version of deepseek-ai/DeepSeek-V3.1 using Reinforcement Learning (RL) on the Hendrycks MATH dataset. It was trained using the Tinker framework.

Model Details

Base Model: deepseek-ai/DeepSeek-V3.1

Training Method: Reinforcement Learning (likely Group Relative Policy Optimization / PPO-group based on group size configuration)

Dataset: Hendrycks MATH (Training split, filtering out MATH-500 test set)

Language: English

Task: Mathematical Problem Solving

Training Configuration

The model was trained with the following hyperparameters:

Environment: math (Hendrycks MATH)

Group Size: 16 (Number of samples generated per prompt for advantage estimation)

Groups Per Batch: 64

Learning Rate: 2e-4

Max Generation Tokens: 512

Framework: Tinker Cookbook

Usage

Prompt Format

The model expects the problem statement followed by a specific instruction suffix and relies on a few-shot prompting strategy during training.

Suffix: " Write your answer in \boxed{} format."

Example:

Problem: “How many r’s are in strawberry?”

Prompt: problem + " Write your answer in \boxed{} format."

Metrics during training

Test on MATH-500 datasets:

Nagi-ovo
/

DeepSeek-V3.1-Math-RL-G16-LoRA

DeepSeek-V3.1 Math RL (Tinker)

Model Details

Training Configuration

Usage

Prompt Format

Metrics during training

Model tree for Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

Dataset used to train Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

Collection including Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

LLM-RL

Nagi-ovo
/

DeepSeek-V3.1-Math-RL-G16-LoRA

DeepSeek-V3.1 Math RL (Tinker)

Model Details

Training Configuration

Usage

Prompt Format

Metrics during training

Model tree for Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

Dataset used to train Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

Collection including Nagi-ovo/DeepSeek-V3.1-Math-RL-G16-LoRA

LLM-RL