arxiv:2602.05261

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Published on Feb 5

· Submitted by

Fanfan Liu on Feb 6

#3 Paper of the day

Upvote

Authors:

Fanfan Liu ,

Abstract

Research analyzes RLVR algorithms' impact on response length in LLMs and VLMs, proposing LUSPO to eliminate length bias and improve reasoning performance.

AI-generated summary

Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.

View arXiv page View PDF GitHub 7 Add to collection

Community

liufanfanlff

Paper author Paper submitter about 18 hours ago

We introduce Length-Unbiased Sequence Policy Optimization (LUSPO), a novel reinforcement learning algorithm for training large language models. LUSPO consistently outperforms GRPO and GSPO on both dense small-scale models and large-scale MoE models. github: https://github.com/murphy4122/LUSPO

sorryhyun

about 12 hours ago

This reminds my master's thesis https://arxiv.org/abs/2504.06037

avahal

39 minutes ago

arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/length-unbiased-sequence-policy-optimization-revealing-and-controlling-response-length-variation-in-rlvr-6117-71c4edfe

Executive Summary
Detailed Breakdown
Practical Applications

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.05261 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.05261 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.05261 in a Space README.md to link it from this page.

Collections including this paper 1