Not All Steps are Informative:
On the Linearity of LLMs’ RLVR Training

Important

🌟 If you find this repository useful, please consider giving it a star!

🔥 News

[2026/01] We have released the full codebase, including linearity analysis, RL training on verl, acceleration methods, and evaluation scripts. Preprocessed RL datasets and checkpoints are now available.

This repository contains the official implementation of the paper "Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training".

We reveal a critical phenomenon: during RLVR (Reinforcement Learning with Verification Rewards), LLMs evolve in a remarkably linear manner. Leveraging this observation, we demonstrate that future model states can be accurately predicted from intermediate checkpoints via extrapolation, effectively bypassing expensive training steps.

📊 Linearity Analysis

Figure 1: Linearity of model weights and outputs during RLVR training. (a) and (b) show the distribution of $R^2$ scores for weights and token log-probabilities, respectively. Both distributions are strongly concentrated around 0.9, indicating high linearity. (c) plots the trajectories of four randomly selected weights, while (d) tracks token log-probability shifts at four example positions.

Figure 2: Consistency across diverse setups. Linearity remains robust across various settings. $R^2$ scores consistently exceed 0.7 (dashed line) regardless of the base model (e.g., DS-Qwen, DS-Llama), model scale (1.5B to 8B), or training algorithm (GSPO, Reinforce++, and GRPO).

🚀 Accelerating RLVR via Extrapolation

Building on the linearity of RLVR training, we propose Logits Extrapolation, Weight Extrapolation, and RL-Extra. These methods enable the prediction of model behavior at future steps using early trajectories, significantly accelerating the training process.

Key Findings:

Logits Extrapolation: Delivers consistent accuracy improvements over standard RL on benchmarks like AIME and LiveCodeBench (LCB).
Weight Extrapolation: Achieves high fidelity in predicting future weights, particularly on AIME24.
Efficiency: Our methods significantly reduce the number of actual training steps required to reach target accuracy.
RL-Extra vs. GRPO: With a fixed training budget (actual steps $s$), RL-Extra consistently outperforms the GRPO baseline across AIME24, AIME25, MATH500, and LiveCodeBench.

🛠️ Usage

1. Installation

Install the verl environment (ensure you are in the project root):

cd verl
pip3 install -e .[vllm]

2. Data Preparation

We utilize DeepSeek-R1-Distill-Qwen-1.5B as the base model.

Generate Responses: We generated 64 responses for each AIME24 query.
- Path: evaluation/outputs/aime24_distill-qwen-1-5b.json
Preprocessing: Use the provided scripts to format data for RL training and evaluation within the verl framework.
- Script location: verl/examples/data_preprocess

Note: The processed dataset is readily available on HuggingFace Dataset.

3. RL Training (Baseline)

To reproduce the DeepScaleR baseline (using GRPO), run the following command:

bash verl/examples/grpo_trainer/run_distill-qwen-1-5b_deepscaler.sh

4. Linearity Analysis

Model Outputs (Token Log-probs): Use previously generated responses as probes to compute conditional log-probabilities for each token across checkpoints.

# 1. Compute log-probabilities
bash scripts/run_token_logprob_linearity.sh

# 2. Plot R^2 distribution
python3 analysis/token_logprob/plot_token_logprob_linearity.py

Model Weights: Perform linear regression on model weights across training steps.

bash scripts/run_weight_linearity.sh

5. Extrapolation Methods & RL-Extra

Method	Description	Command
Logits Extrapolation	Extrapolates logits to improve performance.	`bash scripts/run_logits_extrapolation.sh`
Weight Extrapolation	Extrapolates weights to accelerate RLVR.	`bash scripts/run_weight_extrapolation.sh`
RL-Extra	Corrects gradient trajectory for efficiency.	`bash scripts/run_rl_extrapolation.sh`

6. Evaluation

Inference (vLLM):

python evaluation/inference_vllm_offline.py \
  --model_path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --data aime24

Metrics Calculation (pass@k, avg@k):

python evaluation/pass_at_k_eval.py

✉️ Contact

For questions or feedback, please contact Tianle Wang.

🖊️ Citation

If you find this work helpful, please cite our paper:

@misc{wang2026stepsinformativelinearityllms,
      title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training}, 
      author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao},
      year={2026},
      eprint={2601.04537},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.04537}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
acceleration		acceleration
analysis		analysis
assets		assets
evaluation		evaluation
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Not All Steps are Informative:
On the Linearity of LLMs’ RLVR Training

📊 Linearity Analysis

🚀 Accelerating RLVR via Extrapolation

🛠️ Usage

1. Installation

2. Data Preparation

3. RL Training (Baseline)

4. Linearity Analysis

5. Extrapolation Methods & RL-Extra

6. Evaluation

✉️ Contact

🖊️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training

📊 Linearity Analysis

🚀 Accelerating RLVR via Extrapolation

🛠️ Usage

1. Installation

2. Data Preparation

3. RL Training (Baseline)

4. Linearity Analysis

5. Extrapolation Methods & RL-Extra

6. Evaluation

✉️ Contact

🖊️ Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Not All Steps are Informative:
On the Linearity of LLMs’ RLVR Training

Packages