Confronting Reward Model Overoptimization with Constrained RLHF

Moskovitz, Ted; Singh, Aaditya K.; Strouse, DJ; Sandholm, Tuomas; Salakhutdinov, Ruslan; Dragan, Anca D.; McAleer, Stephen

Computer Science > Machine Learning

arXiv:2310.04373 (cs)

[Submitted on 6 Oct 2023 (v1), last revised 10 Oct 2023 (this version, v2)]

Title:Confronting Reward Model Overoptimization with Constrained RLHF

Authors:Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D. Dragan, Stephen McAleer

View PDF

Abstract:Large language models are typically aligned with human preferences by optimizing $\textit{reward models}$ (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to $\textit{overoptimization}$, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM's threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2310.04373 [cs.LG]
	(or arXiv:2310.04373v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2310.04373

Submission history

From: Theodore Moskovitz [view email]
[v1] Fri, 6 Oct 2023 16:59:17 UTC (1,152 KB)
[v2] Tue, 10 Oct 2023 15:01:11 UTC (1,153 KB)

Computer Science > Machine Learning

Title:Confronting Reward Model Overoptimization with Constrained RLHF

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Confronting Reward Model Overoptimization with Constrained RLHF

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators