research-article

Open access

Planning and Learning in Partially Observable Systems via Filter Stability

Authors:

Noah Golowich,

Ankur Moitra,

Dhruv RohatgiAuthors Info & Claims

STOC 2023: Proceedings of the 55th Annual ACM Symposium on Theory of Computing

Pages 349 - 362

https://doi.org/10.1145/3564246.3585099

Published: 02 June 2023 Publication History

PDF eReader

Abstract

Partially Observable Markov Decision Processes (POMDPs) are an important model in reinforcement learning that take into account the agent’s uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though this problem is known to be computationally hard. The major obstruction is the Curse of History, which arises because optimal policies for POMDPs may depend on the entire observation history thus far. In this work, we revisit the planning problem and ask: Are there natural and well-motivated assumptions that avoid the Curse of History in POMDP planning (and beyond)?

We assume one-step observability, which stipulates that well-separated distributions on states lead to well-separated distributions on observations. Our main technical result is a new quantitative bound for filter stability in observable Hidden Markov Models (HMMs) and POMDPs – i.e. the rate at which the Bayes filter for the latent state forgets its initialization. We give the following algorithmic applications:

First, a quasipolynomial-time algorithm for planning in one-step observable POMDPs and a matching computational lower bound under the Exponential Time Hypothesis. Crucially, we require no assumptions on the transition dynamics of the POMDP.

Second, a quasipolynomial-time algorithm for improper learning of overcomplete HMMs, which does not require full-rank transitions; full-rankness is violated, for instance, when the number of latent states varies over time. Instead we assume multi-step observability, a generalization of observability which allows observations to be informative in aggregate.

Third, a quasipolynomial-time algorithm for computing approximate coarse correlated equilibria in one-step observable Partially Observable Markov Games (POMGs).

Thus we show that observability gives a blueprint for circumventing computational intractability in a variety of settings with partial observations, including planning, learning and computing equilibria.

References

[1]

Ioannis Anagnostides, Gabriele Farina, and Tuomas Sandholm. 2022. Near-Optimal Phi-Regret Learning in Extensive-Form Games.

Abstract

References

Cited By

Index Terms

Recommendations

Planning and acting in partially observable stochastic domains

Risk-sensitive planning in partially observable environments

Limiting Discounted-Cost Control of Partially Observable Stochastic Systems

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations