Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3564246.3585099acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article
Open access

Planning and Learning in Partially Observable Systems via Filter Stability

Published: 02 June 2023 Publication History

Abstract

Partially Observable Markov Decision Processes (POMDPs) are an important model in reinforcement learning that take into account the agent’s uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though this problem is known to be computationally hard. The major obstruction is the Curse of History, which arises because optimal policies for POMDPs may depend on the entire observation history thus far. In this work, we revisit the planning problem and ask: Are there natural and well-motivated assumptions that avoid the Curse of History in POMDP planning (and beyond)?
We assume one-step observability, which stipulates that well-separated distributions on states lead to well-separated distributions on observations. Our main technical result is a new quantitative bound for filter stability in observable Hidden Markov Models (HMMs) and POMDPs ­– i.e. the rate at which the Bayes filter for the latent state forgets its initialization. We give the following algorithmic applications:
First, a quasipolynomial-time algorithm for planning in one-step observable POMDPs and a matching computational lower bound under the Exponential Time Hypothesis. Crucially, we require no assumptions on the transition dynamics of the POMDP.
Second, a quasipolynomial-time algorithm for improper learning of overcomplete HMMs, which does not require full-rank transitions; full-rankness is violated, for instance, when the number of latent states varies over time. Instead we assume multi-step observability, a generalization of observability which allows observations to be informative in aggregate.
Third, a quasipolynomial-time algorithm for computing approximate coarse correlated equilibria in one-step observable Partially Observable Markov Games (POMGs).
Thus we show that observability gives a blueprint for circumventing computational intractability in a variety of settings with partial observations, including planning, learning and computing equilibria.

References

[1]
Ioannis Anagnostides, Gabriele Farina, and Tuomas Sandholm. 2022. Near-Optimal Phi-Regret Learning in Extensive-Form Games.
[2]
Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. 2014. Tensor Decompositions for Learning Latent Variable Models. J. Mach. Learn. Res., 15, 1 (2014), jan, 2773–2832. issn:1532-4435
[3]
Animashree Anandkumar, Daniel Hsu, and Sham M. Kakade. 2012. A Method of Moments for Mixture Models and Hidden Markov Models. In Proceedings of the 25th Annual Conference on Learning Theory, Shie Mannor, Nathan Srebro, and Robert C. Williamson (Eds.) (Proceedings of Machine Learning Research, Vol. 23). PMLR, Edinburgh, Scotland. 33.1–33.34.
[4]
Sanjeev Arora, Rong Ge, Frederic Koehler, Tengyu Ma, and Ankur Moitra. 2016. Provable algorithms for inference in topic models. In International Conference on Machine Learning. 2859–2867.
[5]
Rami Atar and Ofer Zeitouni. 1997. Exponential stability for nonlinear filtering. In Annales de l’Institut Henri Poincare (B) Probability and Statistics. 33, 697–725.
[6]
Rami Atar and Ofer Zeitouni. 1997. Lyapunov exponents for finite state nonlinear filtering. SIAM Journal on Control and Optimization, 35, 1 (1997), 36–55.
[7]
Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. 2016. Reinforcement learning of POMDPs using spectral methods. In Conference on Learning Theory. 193–256.
[8]
Aditya Bhaskara, Moses Charikar, and Aravindan Vijayaraghavan. 2014. Uniqueness of tensor decompositions with applications to polynomial identifiability. In Conference on Learning Theory. 742–778.
[9]
Aditya Bhaskara, Aidao Chen, Aidan Perreault, and Aravindan Vijayaraghavan. 2019. Smoothed Analysis in Unsupervised Learning via Decoupling. 582–610.
[10]
Xavier Boyen and Daphne Koller. 1998. Tractable Inference for Complex Stochastic Processes. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI’98). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 33–42. isbn:155860555X
[11]
Noam Brown and Tuomas Sandholm. 2018. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359, 6374 (2018), 418–424.
[12]
Dima Burago, Michel De Rougemont, and Anatol Slissenko. 1996. On the complexity of partially observed Markov decision processes. Theoretical Computer Science, 157, 2 (1996), 161–183.
[13]
Anthony R Cassandra, Leslie Pack Kaelbling, and James A Kurien. 1996. Acting under uncertainty: Discrete Bayesian models for mobile-robot navigation. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS’96. 2, 963–972.
[14]
Sitan Chen, Frederic Koehler, Ankur Moitra, and Morris Yau. 2021. Kalman Filtering with Adversarial Corruptions. arXiv preprint arXiv:2111.06395.
[15]
Xi Chen, Xiaotie Deng, and Shang-Hua Teng. 2009. Settling the complexity of computing two-player Nash equilibria. Journal of the ACM (JACM), 56, 3 (2009), 1–57.
[16]
Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. 2009. The complexity of computing a Nash equilibrium. SIAM J. Comput., 39, 1 (2009), 195–259.
[17]
Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. 2007. The Value of Observation for Monitoring Dynamic Systems. In IJCAI. 2474–2479.
[18]
Joseph Futoma, Michael C. Hughes, and Finale Doshi-Velez. 2020. POPCORN: Partially Observed Prediction COnstrained ReiNforcement Learning. CoRR, abs/2001.04032 (2020), arxiv:2001.04032.
[19]
Neha Priyadarshini Garg, David Hsu, and Wee Sun Lee. 2019. DESPOT-Alpha: Online POMDP Planning with Large State and Observation Spaces. In Robotics: Science and Systems.
[20]
Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. 2022. Learning in Observable POMDPs, without Computationally Intractable Oracles.
[21]
Zhaohan Daniel Guo, Shayan Doroudi, and Emma Brunskill. 2016. A pac rl algorithm for episodic pomdps. In Artificial Intelligence and Statistics. 510–518.
[22]
Eric A Hansen. 1998. Solving POMDPs by searching in policy space. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence. 211–219.
[23]
Matthew Hausknecht and Peter Stone. 2015. Deep recurrent q-learning for partially observable mdps. In 2015 aaai fall symposium series.
[24]
Milos Hauskrecht. 2000. Value-function approximations for partially observable Markov decision processes. Journal of artificial intelligence research, 13 (2000), 33–94.
[25]
Milos Hauskrecht and Hamish Fraser. 2000. Planning treatment of ischemic heart disease with partially observable Markov decision processes. Artificial intelligence in medicine, 18, 3 (2000), 221–244.
[26]
Russell Impagliazzo and Ramamohan Paturi. 2001. On the complexity of k-SAT. J. Comput. System Sci., 62, 2 (2001), 367–375.
[27]
Tommi Jaakkola, Satinder P Singh, and Michael I Jordan. 1995. Reinforcement learning algorithm for partially observable Markov decision problems. Advances in neural information processing systems, 345–352.
[28]
Chi Jin, Sham M Kakade, Akshay Krishnamurthy, and Qinghua Liu. 2020. Sample-efficient reinforcement learning of undercomplete POMDPs. arXiv preprint arXiv:2006.12484.
[29]
Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. 1998. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101, 1-2 (1998), 99–134.
[30]
Daniel Kane, Sihan Liu, Shachar Lovett, and Gaurav Mahajan. 2022. Computational-statistical gaps in reinforcement learning. arXiv preprint arXiv:2202.05444.
[31]
Jon Kleinberg and Mark Sandler. 2008. Using mixture models for collaborative filtering. J. Comput. System Sci., 74, 1 (2008), 49–69.
[32]
Tadashi Kozuno, Pierre MENARD, Remi Munos, and Michal Valko. 2021. Learning in two-player zero-sum partially observable Markov games with perfect recall. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (Eds.). https://openreview.net/forum?id=1LLemKrsgQp
[33]
Akshay Krishnamurthy, Alekh Agarwal, and John Langford. 2016. PAC reinforcement learning with rich observations. arXiv preprint arXiv:1602.02722.
[34]
Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. 2021. Reinforcement Learning in Reward-Mixing MDPs. arXiv preprint arXiv:2110.03743.
[35]
Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. 2021. RL for Latent MDPs: Regret Guarantees and a Lower Bound. arXiv preprint arXiv:2102.04939.
[36]
Michael L Littman. 1994. Memoryless policies: Theoretical limitations and practical results. From animals to animats, 3 (1994), 238–245.
[37]
Qinghua Liu, Alan Chung, Csaba Szepesvári, and Chi Jin. 2022. When Is Partially Observable Reinforcement Learning Not Scary? arXiv preprint arXiv:2204.08967.
[38]
Qinghua Liu, Csaba Szepesvári, and Chi Jin. 2022. Sample-Efficient Reinforcement Learning of Partially Observable Markov Games.
[39]
Christopher Lusena, Judy Goldsmith, and Martin Mundhenk. 2001. Nonapproximability results for partially observable Markov decision processes. Journal of artificial intelligence research, 14 (2001), 83–103.
[40]
Curtis McDonald and Serdar Yüksel. 2020. Exponential filter stability via Dobrushin’s coefficient. Electronic Communications in Probability, 25 (2020), 1–13.
[41]
Sanjoy K. Mitter and Irvin Cemil Schick. 1992. Point Estimation, Stochastic Approximation, and Robust Kalman Filtering. In Systems, Models and Feedback: Theory and Applications.
[42]
J. B. Moore and B. D. O. Anderson. 1980. Coping with singular transition matrices in estimation and control stability theory†. Internat. J. Control, 31, 3 (1980), 571–586.
[43]
Tianwei Ni, Benjamin Eysenbach, and Ruslan Salakhutdinov. 2021. Recurrent Model-Free RL is a Strong Baseline for Many POMDPs. arXiv preprint arXiv:2110.05038.
[44]
Christos H Papadimitriou and John N Tsitsiklis. 1987. The complexity of Markov decision processes. Mathematics of operations research, 12, 3 (1987), 441–450.
[45]
Joelle Pineau, Geoffrey Gordon, and Sebastian Thrun. 2006. Anytime point-based approximations for large POMDPs. Journal of Artificial Intelligence Research, 27 (2006), 335–380.
[46]
Yury Polyanskiy and Yihong Wu. 2014. Lecture notes on information theory. Lecture Notes for ECE563 (UIUC) and, 6, 2012-2016 (2014), 7.
[47]
Pascal Poupart and Craig Boutilier. 2004. VDCBPI: an Approximate Scalable Algorithm for Large POMDPs. In NIPS. 1081–1088.
[48]
Stéphane Ross, Joelle Pineau, Sébastien Paquet, and Brahim Chaib-Draa. 2008. Online planning algorithms for POMDPs. Journal of Artificial Intelligence Research, 32 (2008), 663–704.
[49]
Nicholas Roy and Geoffrey J Gordon. 2002. Exponential family PCA for belief compression in POMDPs. Advances in Neural Information Processing Systems, 15 (2002), 1667–1674.
[50]
Ä Schick. 1989. Robust recursive estimation of the state of a discrete-time stochastic linear dynamic system in the presence of heavy-tailed observation noise. Ph.D. Dissertation. Massachusetts Institute of Technology.
[51]
Irvin C Schick and Sanjoy K Mitter. 1994. Robust recursive estimation in the presence of heavy-tailed observation noise. The annals of statistics, 1045–1080.
[52]
L. S. Shapley. 1953. Stochastic Games. Proceedings of the National Academy of Sciences, 39, 10 (1953), 1095–1100. arxiv:https://www.pnas.org/doi/pdf/10.1073/pnas.39.10.1095.
[53]
Vatsal Sharan, Sham M Kakade, Percy S Liang, and Gregory Valiant. 2017. Learning Overcomplete HMMs. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). 30, Curran Associates, Inc.
[54]
Yoav Shoham and Kevin Leyton-Brown. 2009. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, Cambridge, UK. isbn:978-0-521-89943-7
[55]
Louis Shue, DO Anderson, and Subhrakanti Dey. 1998. Exponential stability of filters and smoothers for hidden Markov models. IEEE Transactions on Signal Processing, 46, 8 (1998), 2180–2194.
[56]
David Silver and Joel Veness. 2010. Monte-Carlo planning in large POMDPs.
[57]
Trey Smith and Reid Simmons. 2012. Heuristic search value iteration for POMDPs. arXiv preprint arXiv:1207.4166.
[58]
Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. 2013. DESPOT: Online POMDP planning with regularization. Advances in neural information processing systems, 26 (2013), 1772–1780.
[59]
Matthijs TJ Spaan and Nikos Vlassis. 2005. Perseus: Randomized point-based value iteration for POMDPs. Journal of artificial intelligence research, 24 (2005), 195–220.
[60]
Georgios Theocharous and Leslie Kaelbling. 2003. Approximate planning in POMDPs with macro-actions. Advances in Neural Information Processing Systems, 16 (2003), 775–782.
[61]
Ramon Van Handel. 2009. Observability and nonlinear filtering. Probability theory and related fields, 145, 1-2 (2009), 35–74.
[62]
Ramon van Handel. 2009. Uniform Observability of Hidden Markov Models and Filter Stability for Unstable Signals. The Annals of Applied Probability, 19, 3 (2009), 1172–1199.
[63]
Yi Xiong, Ningyuan Chen, Xuefeng Gao, and Xiang Zhou. 2021. Sublinear regret for learning POMDPs. arXiv preprint arXiv:2107.03635.

Cited By

View all
  • (2024)Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS61266.2024.00117(1953-1967)Online publication date: 27-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
STOC 2023: Proceedings of the 55th Annual ACM Symposium on Theory of Computing
June 2023
1926 pages
ISBN:9781450399135
DOI:10.1145/3564246
This work is licensed under a Creative Commons Attribution 4.0 International License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Partially observable Markov decision processes
  2. filter stability

Qualifiers

  • Research-article

Funding Sources

Conference

STOC '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Upcoming Conference

STOC '25
57th Annual ACM Symposium on Theory of Computing (STOC 2025)
June 23 - 27, 2025
Prague , Czech Republic

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)415
  • Downloads (Last 6 weeks)44
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS61266.2024.00117(1953-1967)Online publication date: 27-Oct-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media