When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

Lang, Leon; Foote, Davis; Russell, Stuart; Dragan, Anca; Jenner, Erik; Emmons, Scott

Computer Science > Machine Learning

arXiv:2402.17747 (cs)

[Submitted on 27 Feb 2024 (v1), last revised 5 Nov 2024 (this version, v4)]

Title:When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

Authors:Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

View PDF HTML (experimental)

Abstract:Past analyses of reinforcement learning from human feedback (RLHF) assume that the human evaluators fully observe the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deceptive inflation and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. Under the new assumption that the human's partial observability is known and accounted for, we then analyze how much information the feedback process provides about the return function. We show that sometimes, the human's feedback determines the return function uniquely up to an additive constant, but in other realistic cases, there is irreducible ambiguity. We propose exploratory research directions to help tackle these challenges, experimentally validate both the theoretical concerns and potential mitigations, and caution against blindly applying RLHF in partially observable settings.

Comments:	Advances in Neural Information Processing Systems 37 (NeurIPS 2024)
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Cite as:	arXiv:2402.17747 [cs.LG]
	(or arXiv:2402.17747v4 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.17747

Submission history

From: Leon Lang [view email]
[v1] Tue, 27 Feb 2024 18:32:11 UTC (855 KB)
[v2] Sun, 3 Mar 2024 02:06:42 UTC (855 KB)
[v3] Sat, 8 Jun 2024 12:49:22 UTC (5,510 KB)
[v4] Tue, 5 Nov 2024 16:46:01 UTC (5,599 KB)

Computer Science > Machine Learning

Title:When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators