When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Lang, Leon; Foote, Davis; Russell, Stuart; Dragan, Anca; Jenner, Erik; Emmons, Scott

Computer Science > Machine Learning

arXiv:2402.17747v1 (cs)

[Submitted on 27 Feb 2024 (this version), latest version 5 Nov 2024 (v4)]

Title:When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Authors:Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

View PDF HTML (experimental)

Abstract:Past analyses of reinforcement learning from human feedback (RLHF) assume that the human fully observes the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deception and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. To help address these issues, we mathematically characterize how partial observability of the environment translates into (lack of) ambiguity in the learned return function. In some cases, accounting for partial observability makes it theoretically possible to recover the return function and thus the optimal policy, while in other cases, there is irreducible ambiguity. We caution against blindly applying RLHF in partially observable settings and propose research directions to help tackle these challenges.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Cite as:	arXiv:2402.17747 [cs.LG]
	(or arXiv:2402.17747v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.17747

Submission history

From: Scott Emmons [view email]
[v1] Tue, 27 Feb 2024 18:32:11 UTC (855 KB)
[v2] Sun, 3 Mar 2024 02:06:42 UTC (855 KB)
[v3] Sat, 8 Jun 2024 12:49:22 UTC (5,510 KB)
[v4] Tue, 5 Nov 2024 16:46:01 UTC (5,599 KB)

Computer Science > Machine Learning

Title:When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators