article

Interactive POMDPs with finite-state models of other agents

Authors:

Alessandro Panella,

Piotr GmytrasiewiczAuthors Info & Claims

Autonomous Agents and Multi-Agent Systems, Volume 31, Issue 4

Pages 861 - 904

https://doi.org/10.1007/s10458-016-9359-z

Published: 01 July 2017 Publication History

Abstract

We consider an autonomous agent facing a stochastic, partially observable, multiagent environment. In order to compute an optimal plan, the agent must accurately predict the actions of the other agents, since they influence the state of the environment and ultimately the agent's utility. To do so, we propose a special case of interactive partially observable Markov decision process, in which the agent does not explicitly model the other agents' beliefs and preferences, and instead represents them as stochastic processes implemented by probabilistic deterministic finite state controllers (PDFCs). The agent maintains a probability distribution over the PDFC models of the other agents, and updates this belief using Bayesian inference. Since the number of nodes of these PDFCs is unknown and unbounded, the agent places a Bayesian nonparametric prior distribution over the infinitely dimensional set of PDFCs. This allows the size of the learned models to adapt to the complexity of the observed behavior. Deriving the posterior distribution is in this case too complex to be amenable to analytical computation; therefore, we provide a Markov chain Monte Carlo algorithm that approximates the posterior beliefs over the other agents' PDFCs, given a sequence of (possibly imperfect) observations about their behavior. Experimental results show that the learned models converge behaviorally to the true ones. We consider two settings, one in which the agent first learns, then interacts with other agents, and one in which learning and planning are interleaved. We show that the agent's performance increases as a result of learning in both situations. Moreover, we analyze the dynamics that ensue when two agents are simultaneously learning about each other while interacting, showing in an example environment that coordination emerges naturally from our approach. Furthermore, we demonstrate how an agent can exploit the learned models to perform indirect inference over the state of the environment via the modeled agent's actions.

References

[1]

Albrecht, S., Crandall, J., & Ramamoorthy, S. (2016). Belief and truth in hypothesised behaviours. Artificial Intelligence, 235, 63---94.

Digital Library

[2]

Balle, B., Quattoni, A., & Carreras, X. (2011). A spectral learning algorithm for finite state transducers. In D. Gunopulos, T. Hofmann, D. Malerba, M. Vazirgiannis (Eds.) Machine learning and knowledge discovery in databases, Lecture Notes in Computer Science, vol. 6911, (pp. 156---171). Berlin, Heidelberg: Springer.

Digital Library

[3]

Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819---840.

Digital Library

[4]

Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136, 215---250.

Digital Library

[5]

Brown, G. W. (1951). Iterative solutions of games by fictitious play. In Activity analysis of production and allocation, (pp. 374---376). London: Wiley.

[6]

Carmel, D., Markovitch, S. (1996). Learning models of intelligent agents. In Proceedings of the 13th national conference on artificial intelligence, (pp. 62---67).

Digital Library

[7]

Celeux, G., Hurn, M., & Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95(451), 957---970.

[8]

Chakraborty, D., & Stone, P. (2008). Online multiagent learning against memory bounded adversaries. In Machine learning and knowledge discovery in databases, European conference, ECML/PKDD 2008, Antwerp, Belgium, September 15---19, 2008, Proceedings, Part I, (pp. 211---226).

[9]

Choi, J., & Kim, K. E. (2011). Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12, 691---730.

Digital Library

[10]

Conitzer, V., & Sandholm, T. (2007). Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1---2), 23---43.

Digital Library

[11]

Conroy, R., Zeng, Y., Cavazza, M., & Chen, Y. (2015). Learning behaviors in agents systems with interactive dynamic influence diagrams. In Proceedings of the twenty-fourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25---31, 2015, (pp. 39---45).

Digital Library

[12]

Dennett, D. C. (1971). Intentional systems. Journal of Philosophy, 68(February), 87---106.

[13]

Doshi, P., & Gmytrasiewicz, P. J. (2006). On the difficulty of achieving equilibrium in interactive POMDPs. In Proceedings of the 21st national conference on artificial intelligence, vol. 2, AAAI'06, (pp. 1131---1136). AAAI Press.

Digital Library

[14]

Doshi, P., & Gmytrasiewicz, P. J. (2009). Monte Carlo sampling methods for approximating interactive POMDPs. Journal of Artificial Intelligence Research, 34(1), 297---337.

Digital Library

[15]

Doshi, P., & Perez, D. (2008). Generalized point based value iteration for interactive POMDPs. In D. Fox, & C. P. Gomes (Eds.) AAAI, (pp. 63---68). AAAI Press.

Digital Library

[16]

Doshi, P., Zeng, Y., & Chen, Q. (2009). Graphical models for interactive POMDPs: Representations and solutions. Autonomous Agents and Multi-Agent Systems, 18(3), 376---416.

Digital Library

[17]

Doshi-Velez, F., Pfau, D., Wood, F., & Roy, N. (2013). Bayesian nonparametric methods for partially-observable reinforcement learning. In IEEE transactions on pattern analysis and machine intelligence99(PrePrints), 1.

[18]

Doucet, A., & Johansen, A. M. (2009). A tutorial on particle filtering and smoothing: Fifteen years later. In D. Crisan & B. Rozovsky (Eds.), The oxford handbook of nonlinear filtering. Oxford: Oxford University Press.

[19]

Escobar, M. D., & West, M. (1994). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577---588.

[20]

Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. In MIT Press series on economic learning and social evolution. The MIT Press, Cambridge (Mass.), London.

[21]

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis (2nd ed.). London: Chapman and Hall/CRC.

[22]

Gmytrasiewicz, P. J. (1995). On reasoning about other agents. In Intelligent agents II, agent theories, architectures, and languages, IJCAI '95, workshop (ATAL), Montreal, Canada, August 19---20, 1995, Proceedings, (pp. 143---155).

Digital Library

[23]

Gmytrasiewicz, P. J., & Doshi, P. (2005). A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research, 24(1), 49---79.

Digital Library

[24]

Green, P. J., & Richardson, S. (2001). Modelling heterogeneity with and without the Dirichlet process. Scandinavian Journal of Statistics, 28(2), 355---375.

[25]

Hansen, E. (1998). Solving POMDPs by searching in policy space. In Proceedings of the 14th international conference on uncertainty in artificial intelligence, (pp. 211---219).

Digital Library

[26]

Harsanyi, J. (1967). Games with incomplete information played by "Bayesian" players. Management Science, 14(3), 159---182.

Digital Library

[27]

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97---109.

[28]

de la Higuera, C. (2010). Grammatical inference: Learning automata and grammars. New York, NY: Cambridge University Press.

Digital Library

[29]

Hjort, N. L., Holmes, C., Müller, P., & Walker, S. G. (Eds.). (2010). Bayesian nonparametrics. Cambridge: Cambridge University Press.

[30]

Jain, S., & Neal, R. M. (2004). A split-merge markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13(1), 158---182.

[31]

Jain, S., & Neal, R. M. (2007). Splitting and merging components of a nonconjugate Dirichlet process mixture model. Bayesian Analysis, 2(3), 445---472.

[32]

Kadane, J. B., & Larkey, P. D. (1982). Subjective probability and the theory of games. Management Science, 28(2), 113---120.

Digital Library

[33]

Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99---134.

[34]

Kalai, E., & Lehrer, E. (1993). Rational learning leads to nash equilibrium. Econometrica, 61(5), 1019---1045.

[35]

Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In Proceedings of the 17th European conference on machine learning, ECML'06, (pp. 282---293). Berlin, Heidelberg: Springer.

Digital Library

[36]

Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of 11th international conference on machine learning, (pp. 157---163). Morgan Kaufmann.

Digital Library

[37]

Liu, M., Amato, C., Liao, X., Carin, L., & How, J. P. (2015). Stick-breaking policy learning in Dec-POMDPs. In Proceedings of the twenty-fourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25---31, 2015, (pp. 2011---2018).

Digital Library

[38]

Liu, M., Liao, X., & Carin, L. (2011). The infinite regionalized policy representation. In L. Getoor, T. Scheffer (Eds.) Proceedings of the 28th international conference on machine learning, (pp. 769---776).

Digital Library

[39]

Lopes, H., Carvalho, C. M., Johannes, M. S., & Polson, N. G. (2011). Particle learning for sequential Bayesian computation. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, Smith, A. F. M., West, M. (Eds.) Bayesian Statistics 9, (pp. 317---360). Oxford: Oxford University Press.

[40]

Mccallum, A. K. (1996). Reinforcement learning with selective perception and hidden State. Ph.D. Thesis, The University of Rochester

Digital Library

[41]

Meuleau, N., Peshkin, L., Kim, K. E., & Kaelbling, L. P. (1999). Learning finite-state controllers for partially observable environments. In Proceedings of the 15th international conference on uncertainty in artificial intelligence, (pp. 427---436).

Digital Library

[42]

Miller, J. M., & Harrison, M. T.: Mixture models with a prior on the number of components. CoRR arXiv:1502.06241v1 {stat.ME} (2015). Preprint

[43]

Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249---265.

[44]

Ng, A. Y., & Russell, S. (2000). Algorithms for inverse reinforcement learning. In Proceedings of the 17th international conference on machine learning, (pp. 663---670). Morgan Kaufmann.

Digital Library

[45]

Oncina, J., García, P., & Vidal, E. (1993). Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(5), 448---458.

Digital Library

[46]

Paisley, J., & Carin, L. (2009). Hidden Markov models with stick-breaking priors. IEEE Transactions on Signal Processing, 57(10), 3905---3917.

Digital Library

[47]

Papadimitriou, C., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12(3), 441---450.

[48]

Pfau, D., Bartlett, N., & Wood, F. (2010). Probabilistic deterministic infinite automata. In Advances in neural information processing systems, (pp. 1930---1938).

Digital Library

[49]

Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based value iteration: an anytime algorithm for POMDPs. In Proceedings of the 18th international joint conference on artificial intelligence, IJCAI'03, (pp. 1025---1030). San Francisco, CA: Morgan Kaufmann Publishers Inc.

Digital Library

[50]

Polich, K., & Gmytrasiewicz, P. (2007). Interactive dynamic influence diagrams. In Proceedings of the 6th international joint conference on autonomous agents and multiagent systems, AAMAS '07, (pp. 341---343). New York, NY: ACM.

Digital Library

[51]

Poupart, P., Boutilier, C. (2003). Bounded finite state controllers. In Advances in neural information processing systems 16.

Digital Library

[52]

Powers, R., & Shoham, Y. (2005). Learning against opponents with bounded memory. In Proceedings of the 19th international joint conference on artificial intelligence, IJCAI'05, (pp. 817---822). San Francisco, CA: Morgan Kaufmann Publishers Inc.

Digital Library

[53]

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, (pp. 257---286).

[54]

Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement learning. In Proceedings of the 20th international joint conference on artical intelligence, vol. 51, pp. 2586---2591.

Digital Library

[55]

Ross, S., Draa, B. C., & Pineau, J. (2007). Bayes-adaptive POMDPs. In Proceedings of the conference on neural information processing systems.

Digital Library

[56]

Russell, S., & Norvig, P. (2009). Artificial intelligence: A modern approach (3rd ed.). Englewood Cliffs, NJ: Prentice Hall.

Digital Library

[57]

Shoham, Y., & Leyton-Brown, K. (2008). Multiagent systems: Algorithmic, game-theoretic, and logical foundations. New York, NY: Cambridge University Press.

Digital Library

[58]

Silver, D., & Veness, J. (2010). Monte-Carlo planning in large POMDPs. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems 23 (pp. 2164---2172). Curran Associates Inc.

Digital Library

[59]

Sondik, E. J. (1978). The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research, 26(2), 282---304.

Digital Library

[60]

Sonu, E., & Doshi, P. (2012). Generalized and bounded policy iteration for interactive POMDPs. In International symposium on artificial intelligence and mathematics (ISAIM).

[61]

Wright, J. R., & Leyton-Brown, K. (2012). Behavioral game theoretic models: A Bayesian framework for parameter analysis. In International conference on autonomous agents and multiagent systems, AAMAS 2012, Valencia, Spain, June 4---8, 2012 (3 Volumes), (pp. 921---930).

Digital Library

[62]

Yoshida, W., Dolan, R. J., & Friston, K. J. (2008). Game theory of mind. PLoS Comput Biol, 4(12), e1000,254+.

[63]

Zeng, Y., & Doshi, P. (2012). Exploiting model equivalences for solving interactive dynamic influence diagrams. Journal of Artificial intelligence Research, 43(1), 211---255.

Digital Library

[64]

Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd national conference on artificial intelligence, vol. 3, AAAI'08, (pp. 1433---1438). AAAI Press.

Digital Library

Cited By

Reily BGao PHan FWang HZhang H(2022)Real-time recognition of team behaviors by multisensory graph-embedded robot learningInternational Journal of Robotics Research10.1177/0278364921104315541:8(798-811)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1177/02783649211043155
Amini SPalhang MMozayani N(2022)POMCP-based decentralized spatial task allocation algorithms for partially observable environmentsApplied Intelligence10.1007/s10489-022-04142-753:10(12613-12631)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.1007/s10489-022-04142-7
de Weerd HVerbrugge RVerheij B(2022)Higher-order theory of mind is especially useful in unpredictable negotiationsAutonomous Agents and Multi-Agent Systems10.1007/s10458-022-09558-636:2Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1007/s10458-022-09558-6
Show More Cited By

Interactive POMDPs with finite-state models of other agents
1. Computing methodologies

Recommendations

Nonparametric Bayesian Learning of Other Agents? Policies in Interactive POMDPs
AAMAS '15: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems

We consider an autonomous POMDP agent facing a multiagent environment with unknown opponents, that are modeled as finite state controllers. The agent first learns the models from (imperfectly) observed behavior, and subsequently exploits them in ...
Bayesian learning of other agents' finite controllers for interactive POMDPs
AAAI'16: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

We consider an autonomous agent operating in a stochastic, partially-observable, multiagent environment, that explicitly models the other agents as probabilistic deterministic finite-state controllers (PDFCs) in order to predict their actions. We assume ...
Graphical models for interactive POMDPs: representations and solutions

We develop new graphical representations for the problem of sequential decision making in partially observable multiagent environments, as formalized by interactive partially observable Markov decision processes (I-POMDPs). The graphical models called ...

Comments

Information & Contributors

Information

Published In

cover image Autonomous Agents and Multi-Agent Systems

Autonomous Agents and Multi-Agent Systems Volume 31, Issue 4

July 2017

176 pages

ISSN:1387-2532

Issue’s Table of Contents

Copyright © Copyright © 2017 The Author(s).

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 July 2017

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Reily BGao PHan FWang HZhang H(2022)Real-time recognition of team behaviors by multisensory graph-embedded robot learningInternational Journal of Robotics Research10.1177/0278364921104315541:8(798-811)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1177/02783649211043155
Amini SPalhang MMozayani N(2022)POMCP-based decentralized spatial task allocation algorithms for partially observable environmentsApplied Intelligence10.1007/s10489-022-04142-753:10(12613-12631)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.1007/s10489-022-04142-7
de Weerd HVerbrugge RVerheij B(2022)Higher-order theory of mind is especially useful in unpredictable negotiationsAutonomous Agents and Multi-Agent Systems10.1007/s10458-022-09558-636:2Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1007/s10458-022-09558-6
Simões Mda Silva RNogueira T(2020)A Dataset Schema for Cooperative Learning from Demonstration in Multi-robot SystemsJournal of Intelligent and Robotic Systems10.1007/s10846-019-01123-w99:3-4(589-608)Online publication date: 1-Sep-2020
https://dl.acm.org/doi/10.1007/s10846-019-01123-w
Ghorbani FAfsharchi MDerhami V(2020)learning with policy prediction in continuous state-action multi-agent decision processesSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-019-04600-424:2(901-918)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1007/s00500-019-04600-4
Unhelkar VShah J(2019)Learning models of sequential decision-making with partial specification of agent behaviorProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v33i01.33012522(2522-2530)Online publication date: 27-Jan-2019
https://dl.acm.org/doi/10.1609/aaai.v33i01.33012522
Zhao YWang XWang CCong YShen L(2019)Systemic design of distributed multi-UAV cooperative decision-making for multi-target trackingAutonomous Agents and Multi-Agent Systems10.1007/s10458-019-09401-533:1-2(132-158)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s10458-019-09401-5

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents