Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Finite-time Analysis of the Multiarmed Bandit Problem

Published: 01 May 2002 Publication History

Abstract

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

References

[1]
Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27, 1054-1078.
[2]
Berry, D., & Fristedt, B. (1985). Bandit problems. London: Chapman and Hall.
[3]
Burnetas, A., & Katehakis, M. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17:2, 122-142.
[4]
Duff, M. (1995). Q-learning for bandit problems. In Proceedings of the 12th International Conference on Machine Learning (pp. 209-217).
[5]
Gittins, J. (1989). Multi-armed bandit allocation indices, Wiley-Interscience series in Systems and Optimization. New York: John Wiley and Sons.
[6]
Holland, J. (1992). Adaptation in natural and artificial systems. Cambridge: MIT Press/Bradford Books.
[7]
Ishikida, T., & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications, 83:1, 113-154.
[8]
Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4-22.
[9]
Pollard, D. (1984). Convergence of stochastic processes. Berlin: Springer.
[10]
Sutton, R., & Barto, A. (1998). Reinforcement learning, an introduction. Cambridge: MIT Press/Bradford Books.
[11]
Wilks, S. (1962). Matematical statistics. New York: John Wiley and Sons.
[12]
Yakowitz, S., & Lowe, W. (1991). Nonparametric bandit methods. Annals of Operations Research, 28, 297- 312.

Cited By

View all
  • (2024)Fairness and Privacy Guarantees in Federated Contextual BanditsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663197(2471-2473)Online publication date: 6-May-2024
  • (2024)LgTS: Dynamic Task Sampling using LLM-generated Sub-Goals for Reinforcement Learning AgentsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663035(1736-1744)Online publication date: 6-May-2024
  • (2024)Improving Mobile Maternal and Child Health Care Programs: Collaborative Bandits for Time Slot SelectionProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663014(1540-1548)Online publication date: 6-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Machine Language
Machine Language  Volume 47, Issue 2-3
May-June 2002
163 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 May 2002

Author Tags

  1. adaptive allocation rules
  2. bandit problems
  3. finite horizon regret

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Fairness and Privacy Guarantees in Federated Contextual BanditsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663197(2471-2473)Online publication date: 6-May-2024
  • (2024)LgTS: Dynamic Task Sampling using LLM-generated Sub-Goals for Reinforcement Learning AgentsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663035(1736-1744)Online publication date: 6-May-2024
  • (2024)Improving Mobile Maternal and Child Health Care Programs: Collaborative Bandits for Time Slot SelectionProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663014(1540-1548)Online publication date: 6-May-2024
  • (2024)Observer-Aware Planning with Implicit and Explicit CommunicationProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663000(1409-1417)Online publication date: 6-May-2024
  • (2024)Lifetime policy reuse and the importance of task capacityAI Communications10.3233/AIC-23004037:1(115-148)Online publication date: 21-Mar-2024
  • (2024)Online Learning and Pricing for Service Systems with Reusable ResourcesOperations Research10.1287/opre.2022.238172:3(1203-1241)Online publication date: 1-May-2024
  • (2024)Efficient Decentralized Multi-agent Learning in Asymmetric Bipartite Queueing SystemsOperations Research10.1287/opre.2022.029172:3(1049-1070)Online publication date: 1-May-2024
  • (2024)UCB-Type Learning Algorithms with Kaplan–Meier Estimator for Lost-Sales Inventory Models with Lead TimesOperations Research10.1287/opre.2022.027372:4(1317-1332)Online publication date: 29-Feb-2024
  • (2024)Multiobjective Stochastic OptimizationManufacturing & Service Operations Management10.1287/msom.2020.024726:2(500-518)Online publication date: 1-Mar-2024
  • (2024)Distribution-Free Contextual Dynamic PricingMathematics of Operations Research10.1287/moor.2023.136949:1(599-618)Online publication date: 1-Feb-2024
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media