Article

Reinforcement learning for POMDPs based on action values and stochastic optimization

Author:

Theodore J. PerkinsAuthors Info & Claims

Eighteenth national conference on Artificial intelligence

Pages 199 - 204

Published: 28 July 2002 Publication History

Abstract

We present a new, model-free reinforcement learning algorithm for learning to control partially-observable Markov decision processes. The algorithm incorporates ideas from action-value based reinforcement learning approaches, such as Q-Learning, as well as ideas from the stochastic optimization literature. Key to our approach is a new definition of action value, which makes the algorithm theoretically sound for partially-observable settings. We show that special cases of our algorithm can achieve probability one convergence to locally optimal policies in the limit, or probably approximately correct hill-climbing to a locally optimal policy in a finite number of samples.

References

[1]

Baird, L. C., and Moore, A.W. 1999. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11. MIT Press.

Digital Library

[2]

Baird, L. C. 1995. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning , 30-37. Morgan Kaufmann.

[3]

Gordon, G. 1996. Chattering in Sarsa(λ). CMU Learning Lab Internal Report. Available at www.cs.cmu.edu/~ggordon.

[4]

Greiner, R. 1996. PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence 84(1-2):177-204.

Digital Library

[5]

Kaelbling, L.P. 1993. Learning in embedded systems. Cambridge, MA: MIT Press.

Digital Library

[6]

Kleywegt, A. J.; Shapiro, A.; and de Mello, T. H. 2001. The sample average approximation method for stochastic discrete optimization. SIAM Journal of Optimization 12:479-502.

Digital Library

[7]

Littman, M. L. 1994. Memoryless policies: Theoretical limitations and practical results. In From Animals to Animats 3: Proceedings of the Third International Conference on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press.

Digital Library

[8]

Loch, J., and Singh, S. 1998. Using eligibility traces to find the best memoryless policy in a partially observable Markov decision process. In Proceedings of the Fifteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann.

Digital Library

[9]

Madani, O.; Condon, A.; and Hanks, S. 1999. On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision process problems. In Proceedings of the Sixteenth National Conference on Artificial Intelligence. Cambridge, MA: MIT Press.

Digital Library

[10]

Maron, O., and Moore, A. 1994. Hoeffding races: Accelerating model selection search for classification and function approximation. In Advances in Neural Information Processing Systems 6, 59-66.

[11]

McCallum, A. K. 1995. Reinforcement learning with selective perception and hidden state. Ph.D. Dissertation, University of Rochester.

Digital Library

[12]

Ng, A., and Jordan, M. 2000. PEGASUS: A policy search method for large MDPs and POMDPs. In Uncertainty in Artificial Intelligence, Proceedings of the Sixteenth Conference .

Digital Library

[13]

Parr, R., and Russell, S. 1995. Approximating optimal policies for partially observable stochastic domains: In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI-95). San Francisco, CA: Morgan Kaufmann.

Digital Library

[14]

Pendrith, M.D., and McGarity, M. J. 1998. An analysis of direct reinforcement learning in non-Markovian domains. In Machine Learning: Proceedings of the 15th International Conference, 421-429.

Digital Library

[15]

Pendrith, M. D., and Ryan, M. R. K. 1996. Actual return reinforcement learning versus temporal differences: Some theoretical and experimental results. In Saitta, L., ed., Machine Learning: Proceedings of the 13th International Conference, 373-381.

[16]

Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. Cambridge, Massachusetts: MIT Press/Bradford Books.

Digital Library

[17]

Sutton, R. S.; McAllister, D.; Singh, S.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12. MIT Press.

[18]

Whitehead, S. D. 1992. Reinforcement Learning for the Adaptive Control of Perception and Action. Ph.D. Dissertation, University of Rochester.

Digital Library

[19]

Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8:229-256.

Digital Library

Cited By

Chandrasekaran MDoshi PZeng YChen Y(2017)Can bounded and self-interested agents be teammates? Application to planning in ad hoc teamsAutonomous Agents and Multi-Agent Systems10.1007/s10458-016-9354-431:4(821-860)Online publication date: 1-Jul-2017
https://dl.acm.org/doi/10.1007/s10458-016-9354-4
Ceren RJonker CMarsella SThangarajah JTuyls K(2016)Learning to Act Optimally in Partially Observable Multiagent SettingsProceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems10.5555/2936924.2937241(1532-1533)Online publication date: 9-May-2016
https://dl.acm.org/doi/10.5555/2936924.2937241
Ceren RDoshi PBanerjee BJonker CMarsella SThangarajah JTuyls K(2016)Reinforcement Learning in Partially Observable Multiagent SettingsProceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems10.5555/2936924.2937002(530-538)Online publication date: 9-May-2016
https://dl.acm.org/doi/10.5555/2936924.2937002
Show More Cited By

Index Terms

Reinforcement learning for POMDPs based on action values and stochastic optimization

Recommendations

Bayesian Reinforcement Learning in Factored POMDPs
AAMAS '19: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems

Model-based Bayesian Reinforcement Learning (BRL) provides a principled solution to dealing with the exploration-exploitation trade-off, but such methods typically assume a fully observable environments. The few Bayesian RL methods that are applicable ...
Hierarchical Average Reward Reinforcement Learning

Hierarchical reinforcement learning (HRL) is a general framework for scaling reinforcement learning (RL) to problems with large state and action spaces by using the task (or action) structure to restrict the space of policies. Prior work in HRL ...
Data-efficient reinforcement learning in continuous state-action Gaussian-POMDPs
NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems

We present a data-efficient reinforcement learning method for continuous state-action systems under significant observation noise. Data-efficient solutions under small noise exist, such as PILCO which learns the cartpole swing-up task in 30s. PILCO ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Eighteenth national conference on Artificial intelligence

July 2002

1068 pages

ISBN:0262511290

Sponsors

NSF: National Science Foundation
Alberta Informatics Circle of Research Excellence (iCORE)
SIGAI: ACM Special Interest Group on Artificial Intelligence
Naval Research Laboratory: Naval Research Laboratory
AAAI: American Association for Artificial Intelligence
NASA Ames Research Center: NASA Ames Research Center
DARPA: Defense Advanced Research Projects Agency

Publisher

American Association for Artificial Intelligence

United States

Publication History

Published: 28 July 2002

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chandrasekaran MDoshi PZeng YChen Y(2017)Can bounded and self-interested agents be teammates? Application to planning in ad hoc teamsAutonomous Agents and Multi-Agent Systems10.1007/s10458-016-9354-431:4(821-860)Online publication date: 1-Jul-2017
https://dl.acm.org/doi/10.1007/s10458-016-9354-4
Ceren RJonker CMarsella SThangarajah JTuyls K(2016)Learning to Act Optimally in Partially Observable Multiagent SettingsProceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems10.5555/2936924.2937241(1532-1533)Online publication date: 9-May-2016
https://dl.acm.org/doi/10.5555/2936924.2937241
Ceren RDoshi PBanerjee BJonker CMarsella SThangarajah JTuyls K(2016)Reinforcement Learning in Partially Observable Multiagent SettingsProceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems10.5555/2936924.2937002(530-538)Online publication date: 9-May-2016
https://dl.acm.org/doi/10.5555/2936924.2937002
Chandrasekaran MDoshi PZeng YChen YBazzan AHuhns MLomuscio AScerri P(2014)Team behavior in interactive dynamic influence diagrams with applications to ad hoc teamsProceedings of the 2014 international conference on Autonomous agents and multi-agent systems10.5555/2615731.2616061(1559-1560)Online publication date: 5-May-2014
https://dl.acm.org/doi/10.5555/2615731.2616061
Leonetti MIocchi LRamamoorthy S(2012)Induction and learning of finite-state controllers from simulationProceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 310.5555/2343896.2343922(1203-1204)Online publication date: 4-Jun-2012
https://dl.acm.org/doi/10.5555/2343896.2343922
Leonetti MIocchi LRamamoorthy S(2011)Reinforcement learning through global stochastic search in N-MDPsProceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II10.5555/2034117.2034139(326-340)Online publication date: 5-Sep-2011
https://dl.acm.org/doi/10.5555/2034117.2034139
Leonetti MIocchi L(2011)LearnPNPRoboCup 201010.5555/1983806.1983843(418-429)Online publication date: 1-Jan-2011
https://dl.acm.org/doi/10.5555/1983806.1983843
Leonetti MIocchi LRamamoorthy S(2011)Reinforcement learning through global stochastic search in N-MDPsProceedings of the 2011th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II10.1007/978-3-642-23783-6_21(326-340)Online publication date: 5-Sep-2011
https://dl.acm.org/doi/10.1007/978-3-642-23783-6_21
Leonetti MIocchi LLuck MSen S(2010)Improving the performance of complex agent plans through reinforcement learningProceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1 - Volume 110.5555/1838206.1838302(723-730)Online publication date: 10-May-2010
https://dl.acm.org/doi/10.5555/1838206.1838302
Finney SGardiol NKaelbling LOates T(2002)The thing that we tried didn't work very wellProceedings of the Eighteenth conference on Uncertainty in artificial intelligence10.5555/2073876.2073895(154-161)Online publication date: 1-Aug-2002
https://dl.acm.org/doi/10.5555/2073876.2073895

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

Media

Figures

Other

Tables

View Table of Contents