Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Interactive POMDPs with finite-state models of other agents

Published: 01 July 2017 Publication History

Abstract

We consider an autonomous agent facing a stochastic, partially observable, multiagent environment. In order to compute an optimal plan, the agent must accurately predict the actions of the other agents, since they influence the state of the environment and ultimately the agent's utility. To do so, we propose a special case of interactive partially observable Markov decision process, in which the agent does not explicitly model the other agents' beliefs and preferences, and instead represents them as stochastic processes implemented by probabilistic deterministic finite state controllers (PDFCs). The agent maintains a probability distribution over the PDFC models of the other agents, and updates this belief using Bayesian inference. Since the number of nodes of these PDFCs is unknown and unbounded, the agent places a Bayesian nonparametric prior distribution over the infinitely dimensional set of PDFCs. This allows the size of the learned models to adapt to the complexity of the observed behavior. Deriving the posterior distribution is in this case too complex to be amenable to analytical computation; therefore, we provide a Markov chain Monte Carlo algorithm that approximates the posterior beliefs over the other agents' PDFCs, given a sequence of (possibly imperfect) observations about their behavior. Experimental results show that the learned models converge behaviorally to the true ones. We consider two settings, one in which the agent first learns, then interacts with other agents, and one in which learning and planning are interleaved. We show that the agent's performance increases as a result of learning in both situations. Moreover, we analyze the dynamics that ensue when two agents are simultaneously learning about each other while interacting, showing in an example environment that coordination emerges naturally from our approach. Furthermore, we demonstrate how an agent can exploit the learned models to perform indirect inference over the state of the environment via the modeled agent's actions.

References

[1]
Albrecht, S., Crandall, J., & Ramamoorthy, S. (2016). Belief and truth in hypothesised behaviours. Artificial Intelligence, 235, 63---94.
[2]
Balle, B., Quattoni, A., & Carreras, X. (2011). A spectral learning algorithm for finite state transducers. In D. Gunopulos, T. Hofmann, D. Malerba, M. Vazirgiannis (Eds.) Machine learning and knowledge discovery in databases, Lecture Notes in Computer Science, vol. 6911, (pp. 156---171). Berlin, Heidelberg: Springer.
[3]
Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819---840.
[4]
Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136, 215---250.
[5]
Brown, G. W. (1951). Iterative solutions of games by fictitious play. In Activity analysis of production and allocation, (pp. 374---376). London: Wiley.
[6]
Carmel, D., Markovitch, S. (1996). Learning models of intelligent agents. In Proceedings of the 13th national conference on artificial intelligence, (pp. 62---67).
[7]
Celeux, G., Hurn, M., & Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95(451), 957---970.
[8]
Chakraborty, D., & Stone, P. (2008). Online multiagent learning against memory bounded adversaries. In Machine learning and knowledge discovery in databases, European conference, ECML/PKDD 2008, Antwerp, Belgium, September 15---19, 2008, Proceedings, Part I, (pp. 211---226).
[9]
Choi, J., & Kim, K. E. (2011). Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12, 691---730.
[10]
Conitzer, V., & Sandholm, T. (2007). Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1---2), 23---43.
[11]
Conroy, R., Zeng, Y., Cavazza, M., & Chen, Y. (2015). Learning behaviors in agents systems with interactive dynamic influence diagrams. In Proceedings of the twenty-fourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25---31, 2015, (pp. 39---45).
[12]
Dennett, D. C. (1971). Intentional systems. Journal of Philosophy, 68(February), 87---106.
[13]
Doshi, P., & Gmytrasiewicz, P. J. (2006). On the difficulty of achieving equilibrium in interactive POMDPs. In Proceedings of the 21st national conference on artificial intelligence, vol. 2, AAAI'06, (pp. 1131---1136). AAAI Press.
[14]
Doshi, P., & Gmytrasiewicz, P. J. (2009). Monte Carlo sampling methods for approximating interactive POMDPs. Journal of Artificial Intelligence Research, 34(1), 297---337.
[15]
Doshi, P., & Perez, D. (2008). Generalized point based value iteration for interactive POMDPs. In D. Fox, & C. P. Gomes (Eds.) AAAI, (pp. 63---68). AAAI Press.
[16]
Doshi, P., Zeng, Y., & Chen, Q. (2009). Graphical models for interactive POMDPs: Representations and solutions. Autonomous Agents and Multi-Agent Systems, 18(3), 376---416.
[17]
Doshi-Velez, F., Pfau, D., Wood, F., & Roy, N. (2013). Bayesian nonparametric methods for partially-observable reinforcement learning. In IEEE transactions on pattern analysis and machine intelligence99(PrePrints), 1.
[18]
Doucet, A., & Johansen, A. M. (2009). A tutorial on particle filtering and smoothing: Fifteen years later. In D. Crisan & B. Rozovsky (Eds.), The oxford handbook of nonlinear filtering. Oxford: Oxford University Press.
[19]
Escobar, M. D., & West, M. (1994). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577---588.
[20]
Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. In MIT Press series on economic learning and social evolution. The MIT Press, Cambridge (Mass.), London.
[21]
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis (2nd ed.). London: Chapman and Hall/CRC.
[22]
Gmytrasiewicz, P. J. (1995). On reasoning about other agents. In Intelligent agents II, agent theories, architectures, and languages, IJCAI '95, workshop (ATAL), Montreal, Canada, August 19---20, 1995, Proceedings, (pp. 143---155).
[23]
Gmytrasiewicz, P. J., & Doshi, P. (2005). A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research, 24(1), 49---79.
[24]
Green, P. J., & Richardson, S. (2001). Modelling heterogeneity with and without the Dirichlet process. Scandinavian Journal of Statistics, 28(2), 355---375.
[25]
Hansen, E. (1998). Solving POMDPs by searching in policy space. In Proceedings of the 14th international conference on uncertainty in artificial intelligence, (pp. 211---219).
[26]
Harsanyi, J. (1967). Games with incomplete information played by "Bayesian" players. Management Science, 14(3), 159---182.
[27]
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97---109.
[28]
de la Higuera, C. (2010). Grammatical inference: Learning automata and grammars. New York, NY: Cambridge University Press.
[29]
Hjort, N. L., Holmes, C., Müller, P., & Walker, S. G. (Eds.). (2010). Bayesian nonparametrics. Cambridge: Cambridge University Press.
[30]
Jain, S., & Neal, R. M. (2004). A split-merge markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13(1), 158---182.
[31]
Jain, S., & Neal, R. M. (2007). Splitting and merging components of a nonconjugate Dirichlet process mixture model. Bayesian Analysis, 2(3), 445---472.
[32]
Kadane, J. B., & Larkey, P. D. (1982). Subjective probability and the theory of games. Management Science, 28(2), 113---120.
[33]
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99---134.
[34]
Kalai, E., & Lehrer, E. (1993). Rational learning leads to nash equilibrium. Econometrica, 61(5), 1019---1045.
[35]
Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In Proceedings of the 17th European conference on machine learning, ECML'06, (pp. 282---293). Berlin, Heidelberg: Springer.
[36]
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of 11th international conference on machine learning, (pp. 157---163). Morgan Kaufmann.
[37]
Liu, M., Amato, C., Liao, X., Carin, L., & How, J. P. (2015). Stick-breaking policy learning in Dec-POMDPs. In Proceedings of the twenty-fourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25---31, 2015, (pp. 2011---2018).
[38]
Liu, M., Liao, X., & Carin, L. (2011). The infinite regionalized policy representation. In L. Getoor, T. Scheffer (Eds.) Proceedings of the 28th international conference on machine learning, (pp. 769---776).
[39]
Lopes, H., Carvalho, C. M., Johannes, M. S., & Polson, N. G. (2011). Particle learning for sequential Bayesian computation. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, Smith, A. F. M., West, M. (Eds.) Bayesian Statistics 9, (pp. 317---360). Oxford: Oxford University Press.
[40]
Mccallum, A. K. (1996). Reinforcement learning with selective perception and hidden State. Ph.D. Thesis, The University of Rochester
[41]
Meuleau, N., Peshkin, L., Kim, K. E., & Kaelbling, L. P. (1999). Learning finite-state controllers for partially observable environments. In Proceedings of the 15th international conference on uncertainty in artificial intelligence, (pp. 427---436).
[42]
Miller, J. M., & Harrison, M. T.: Mixture models with a prior on the number of components. CoRR arXiv:1502.06241v1 {stat.ME} (2015). Preprint
[43]
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249---265.
[44]
Ng, A. Y., & Russell, S. (2000). Algorithms for inverse reinforcement learning. In Proceedings of the 17th international conference on machine learning, (pp. 663---670). Morgan Kaufmann.
[45]
Oncina, J., García, P., & Vidal, E. (1993). Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(5), 448---458.
[46]
Paisley, J., & Carin, L. (2009). Hidden Markov models with stick-breaking priors. IEEE Transactions on Signal Processing, 57(10), 3905---3917.
[47]
Papadimitriou, C., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12(3), 441---450.
[48]
Pfau, D., Bartlett, N., & Wood, F. (2010). Probabilistic deterministic infinite automata. In Advances in neural information processing systems, (pp. 1930---1938).
[49]
Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based value iteration: an anytime algorithm for POMDPs. In Proceedings of the 18th international joint conference on artificial intelligence, IJCAI'03, (pp. 1025---1030). San Francisco, CA: Morgan Kaufmann Publishers Inc.
[50]
Polich, K., & Gmytrasiewicz, P. (2007). Interactive dynamic influence diagrams. In Proceedings of the 6th international joint conference on autonomous agents and multiagent systems, AAMAS '07, (pp. 341---343). New York, NY: ACM.
[51]
Poupart, P., Boutilier, C. (2003). Bounded finite state controllers. In Advances in neural information processing systems 16.
[52]
Powers, R., & Shoham, Y. (2005). Learning against opponents with bounded memory. In Proceedings of the 19th international joint conference on artificial intelligence, IJCAI'05, (pp. 817---822). San Francisco, CA: Morgan Kaufmann Publishers Inc.
[53]
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, (pp. 257---286).
[54]
Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement learning. In Proceedings of the 20th international joint conference on artical intelligence, vol. 51, pp. 2586---2591.
[55]
Ross, S., Draa, B. C., & Pineau, J. (2007). Bayes-adaptive POMDPs. In Proceedings of the conference on neural information processing systems.
[56]
Russell, S., & Norvig, P. (2009). Artificial intelligence: A modern approach (3rd ed.). Englewood Cliffs, NJ: Prentice Hall.
[57]
Shoham, Y., & Leyton-Brown, K. (2008). Multiagent systems: Algorithmic, game-theoretic, and logical foundations. New York, NY: Cambridge University Press.
[58]
Silver, D., & Veness, J. (2010). Monte-Carlo planning in large POMDPs. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems 23 (pp. 2164---2172). Curran Associates Inc.
[59]
Sondik, E. J. (1978). The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research, 26(2), 282---304.
[60]
Sonu, E., & Doshi, P. (2012). Generalized and bounded policy iteration for interactive POMDPs. In International symposium on artificial intelligence and mathematics (ISAIM).
[61]
Wright, J. R., & Leyton-Brown, K. (2012). Behavioral game theoretic models: A Bayesian framework for parameter analysis. In International conference on autonomous agents and multiagent systems, AAMAS 2012, Valencia, Spain, June 4---8, 2012 (3 Volumes), (pp. 921---930).
[62]
Yoshida, W., Dolan, R. J., & Friston, K. J. (2008). Game theory of mind. PLoS Comput Biol, 4(12), e1000,254+.
[63]
Zeng, Y., & Doshi, P. (2012). Exploiting model equivalences for solving interactive dynamic influence diagrams. Journal of Artificial intelligence Research, 43(1), 211---255.
[64]
Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd national conference on artificial intelligence, vol. 3, AAAI'08, (pp. 1433---1438). AAAI Press.

Cited By

View all
  • (2022)Real-time recognition of team behaviors by multisensory graph-embedded robot learningInternational Journal of Robotics Research10.1177/0278364921104315541:8(798-811)Online publication date: 1-Jul-2022
  • (2022)POMCP-based decentralized spatial task allocation algorithms for partially observable environmentsApplied Intelligence10.1007/s10489-022-04142-753:10(12613-12631)Online publication date: 29-Sep-2022
  • (2022)Higher-order theory of mind is especially useful in unpredictable negotiationsAutonomous Agents and Multi-Agent Systems10.1007/s10458-022-09558-636:2Online publication date: 1-Oct-2022
  • Show More Cited By
  1. Interactive POMDPs with finite-state models of other agents

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Autonomous Agents and Multi-Agent Systems
    Autonomous Agents and Multi-Agent Systems  Volume 31, Issue 4
    July 2017
    176 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 July 2017

    Author Tags

    1. Multiagent systems
    2. Opponent modeling
    3. Stochastic planning

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Real-time recognition of team behaviors by multisensory graph-embedded robot learningInternational Journal of Robotics Research10.1177/0278364921104315541:8(798-811)Online publication date: 1-Jul-2022
    • (2022)POMCP-based decentralized spatial task allocation algorithms for partially observable environmentsApplied Intelligence10.1007/s10489-022-04142-753:10(12613-12631)Online publication date: 29-Sep-2022
    • (2022)Higher-order theory of mind is especially useful in unpredictable negotiationsAutonomous Agents and Multi-Agent Systems10.1007/s10458-022-09558-636:2Online publication date: 1-Oct-2022
    • (2020)A Dataset Schema for Cooperative Learning from Demonstration in Multi-robot SystemsJournal of Intelligent and Robotic Systems10.1007/s10846-019-01123-w99:3-4(589-608)Online publication date: 1-Sep-2020
    • (2020)learning with policy prediction in continuous state-action multi-agent decision processesSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-019-04600-424:2(901-918)Online publication date: 1-Jan-2020
    • (2019)Learning models of sequential decision-making with partial specification of agent behaviorProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v33i01.33012522(2522-2530)Online publication date: 27-Jan-2019
    • (2019)Systemic design of distributed multi-UAV cooperative decision-making for multi-target trackingAutonomous Agents and Multi-Agent Systems10.1007/s10458-019-09401-533:1-2(132-158)Online publication date: 1-Mar-2019

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media