Important
🌟 If you find this repository useful, please consider giving it a star!
🔥 News
- [2026/01] We have released the full codebase, including linearity analysis, RL training on
verl, acceleration methods, and evaluation scripts. Preprocessed RL datasets and checkpoints are now available.
This repository contains the official implementation of the paper "Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training".
We reveal a critical phenomenon: during RLVR (Reinforcement Learning with Verification Rewards), LLMs evolve in a remarkably linear manner. Leveraging this observation, we demonstrate that future model states can be accurately predicted from intermediate checkpoints via extrapolation, effectively bypassing expensive training steps.
Figure 1: Linearity of model weights and outputs during RLVR training.
(a) and (b) show the distribution of
Figure 2: Consistency across diverse setups.
Linearity remains robust across various settings.
Building on the linearity of RLVR training, we propose Logits Extrapolation, Weight Extrapolation, and RL-Extra. These methods enable the prediction of model behavior at future steps using early trajectories, significantly accelerating the training process.
Key Findings:
- Logits Extrapolation: Delivers consistent accuracy improvements over standard RL on benchmarks like AIME and LiveCodeBench (LCB).
- Weight Extrapolation: Achieves high fidelity in predicting future weights, particularly on AIME24.
- Efficiency: Our methods significantly reduce the number of actual training steps required to reach target accuracy.
-
RL-Extra vs. GRPO: With a fixed training budget (actual steps
$s$ ), RL-Extra consistently outperforms the GRPO baseline across AIME24, AIME25, MATH500, and LiveCodeBench.
Install the verl environment (ensure you are in the project root):
cd verl
pip3 install -e .[vllm]We utilize DeepSeek-R1-Distill-Qwen-1.5B as the base model.
- Generate Responses: We generated 64 responses for each AIME24 query.
- Path:
evaluation/outputs/aime24_distill-qwen-1-5b.json
- Path:
- Preprocessing: Use the provided scripts to format data for RL training and evaluation within the
verlframework.- Script location:
verl/examples/data_preprocess
- Script location:
Note: The processed dataset is readily available on HuggingFace Dataset.
To reproduce the DeepScaleR baseline (using GRPO), run the following command:
bash verl/examples/grpo_trainer/run_distill-qwen-1-5b_deepscaler.shModel Outputs (Token Log-probs): Use previously generated responses as probes to compute conditional log-probabilities for each token across checkpoints.
# 1. Compute log-probabilities
bash scripts/run_token_logprob_linearity.sh
# 2. Plot R^2 distribution
python3 analysis/token_logprob/plot_token_logprob_linearity.pyModel Weights: Perform linear regression on model weights across training steps.
bash scripts/run_weight_linearity.sh| Method | Description | Command |
|---|---|---|
| Logits Extrapolation | Extrapolates logits to improve performance. | bash scripts/run_logits_extrapolation.sh |
| Weight Extrapolation | Extrapolates weights to accelerate RLVR. | bash scripts/run_weight_extrapolation.sh |
| RL-Extra | Corrects gradient trajectory for efficiency. | bash scripts/run_rl_extrapolation.sh |
Inference (vLLM):
python evaluation/inference_vllm_offline.py \
--model_path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--data aime24Metrics Calculation (pass@k, avg@k):
python evaluation/pass_at_k_eval.pyFor questions or feedback, please contact Tianle Wang.
If you find this work helpful, please cite our paper:
@misc{wang2026stepsinformativelinearityllms,
title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training},
author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao},
year={2026},
eprint={2601.04537},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.04537},
}


