WARP: On the Benefits of Weight Averaged Rewarded Policies

Ramé, Alexandre; Ferret, Johan; Vieillard, Nino; Dadashi, Robert; Hussenot, Léonard; Cedoz, Pierre-Louis; Sessa, Pier Giuseppe; Girgin, Sertan; Douillard, Arthur; Bachem, Olivier

Computer Science > Machine Learning

arXiv:2406.16768 (cs)

[Submitted on 24 Jun 2024]

Title:WARP: On the Benefits of Weight Averaged Rewarded Policies

Authors:Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, Olivier Bachem

View PDF HTML (experimental)

Abstract:Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.

Comments:	11 main pages (34 pages with Appendix)
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.16768 [cs.LG]
	(or arXiv:2406.16768v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2406.16768

Submission history

From: Alexandre Rame [view email]
[v1] Mon, 24 Jun 2024 16:24:34 UTC (815 KB)

Computer Science > Machine Learning

Title:WARP: On the Benefits of Weight Averaged Rewarded Policies

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:WARP: On the Benefits of Weight Averaged Rewarded Policies

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators