Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

The multi-armed bandit, with constraints

  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

Presented in this paper is a self-contained analysis of a Markov decision problem that is known as the multi-armed bandit. The analysis covers the cases of linear and exponential utility functions. The optimal policy is shown to have a simple and easily-implemented form. Procedures for computing such a policy are presented, as are procedures for computing the expected utility that it earns, given any starting state. For the case of linear utility, constraints that link the bandits are introduced, and the constrained optimization problem is solved via column generation. The methodology is novel in several respects, which include the use of elementary row operations to simplify arguments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Altman, E. (1999). Constrained Markov decision processes. Boca Raton, USA: Chapman & Hall/CRC.

    Google Scholar 

  • Bergemann, D., & Välimäkim, J. (2008). Bandit problems. In S. Durlauf & L. Blume (Eds.), The new Palgrave dictionary of economics (Vol. 1, 2nd ed., pp. 336–340). New York: Palgrave Macmillan.

    Chapter  Google Scholar 

  • Berry, D. A., & Friestedt, B. (1985). Bandit problems. London: Chapman & Hall.

    Book  Google Scholar 

  • Bertsimas, D., & Niño-Mora, J. (1993). Conservation laws, extended polymatroids and multi-armed bandit problems: a polyhedral approach to indexable systems. Mathematics of Operations Research, 21, 257–306.

    Article  Google Scholar 

  • Denardo, E. V. (1967). Contraction mappings in the theory underlying dynamic programming. SIAM Review, 9, 165–177.

    Article  Google Scholar 

  • Denardo, E. V., Park, H., & Rothblum, U. G. (2007). Risk-sensitive and risk-neutral multiarmed bandits. Mathematics of Operations Research, 32, 374–394.

    Article  Google Scholar 

  • Denardo, E. V., & Rothblum, U. G. (2006). A turnpike theorem for a risk-sensitive Markov decision problem with stopping. SIAM Journal on Control and Optimization, 45, 414–431.

    Article  Google Scholar 

  • El Karoui, N., & Karatzas, I. (1994). Dynamic allocation indices in continuous time. The Annals of Applied Probability, 4, 255–286.

    Article  Google Scholar 

  • Feinberg, E. A., & Rothblum, U. G. (2012). Splitting randomized stationary policies in total—reward Markov decision processes. Mathematics of Operations Research, 37, 129–153.

    Article  Google Scholar 

  • Gittins, J. C. (1979). Bandit problems and dynamic allocation indices (with discussion). Journal of the Royal Statistical Society. Series B, 41, 148–177.

    Google Scholar 

  • Gittins, J. C. (1989). Multi-armed bandit allocation indices. Chichester: Wiley

    Google Scholar 

  • Gittins, J. C., & Jones, D. M. (1974). A dynamic allocation index for the sequential design experiments. In J. Gani, K. Sarkadu, & I. Vince (Eds.), Progress in statistics, European meeting of statisticians I (pp. 241–266). Amsterdam: North-Holland.

    Google Scholar 

  • Gittins, J. C., Glazebrook, K., & Weber, R. (2011). Multi-armed bandit allocation indices (2nd ed.). Chichester: Wiley

    Book  Google Scholar 

  • Kaspi, H., & Mandelbaum, A. (1998). Multi-armed bandits in discrete and continuous time. The Annals of Applied Probability, 8, 1270–1290.

    Article  Google Scholar 

  • Katehakis, M. N., & Derman, C. (1986). Computing optimal sequential allocation rules in clinical trials. In J. Van Ryzin (Ed.), IMS lecture notes—monograph series: Vol. 8. Adaptive statistical procedures and related topics (pp. 29–39). Hayward: Inst. Math. Stat.

    Chapter  Google Scholar 

  • Katehakis, M. N., & Veinott, A. F. Jr. (1987). The multiarmed bandit problem: decomposition and computation. Mathematics of Operations Research, 22, 262–268.

    Article  Google Scholar 

  • Niño-Mora, J. (2007). A (2/3)n 3 fast pivoting algorithm for the Gittins index and optimal stopping of a Markov chain. INFORMS Journal on Computing, 10, 596–606.

    Article  Google Scholar 

  • Schlag, K. (1998). Why imitate, and if so, how? A bounded rational approach to multi-armed bandits. Journal of Economic Theory, 78, 130–156.

    Article  Google Scholar 

  • Sonin, I. (2008). A generalized Gittins index for Markov chains and its recursive calculation. Statistics & Probability Letters, 78, 1526–1533.

    Article  Google Scholar 

  • Tsitsiklis, J. (1994). A short proof of the Gittins index theorem. The Annals of Applied Probability, 4, 194–199.

    Article  Google Scholar 

  • Variaya, P., Walrand, J., & Buyukkoc, C. (1985). Extensions of the multi-armed bandit problem: The discounted case. IEEE Transactions on Automatic Control, AC-30, 426–439.

    Article  Google Scholar 

  • Veinott, A. F. Jr. (1969). Discrete dynamic programming with sensitive discount optimality criteria. The Annals of Mathematical Statistics, 40, 1635–1660.

    Article  Google Scholar 

  • Weber, R. (1992). On the Gittins index for multiarmed bandits. The Annals of Applied Probability, 2, 1024–1033.

    Article  Google Scholar 

  • Weiss, G. (1988). Branching bandit processes. Probability in the Engineering and Informational Sciences, 2, 269–278.

    Article  Google Scholar 

  • Whittle, P. (1980). Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B, 43, 143–149.

    Google Scholar 

Download references

Acknowledgements

The authors are pleased to acknowledge that this paper has benefited immensely from the reactions of Dr. Pelin Cambolat to earlier drafts. This paper has also been improved markedly by two rounds of very careful, thoughtful and constructive refereeing. The research of the second author has been supported in part by NSF grant CMMI-0928490. The research of the third author has been supported in part by ISF Israel Science Foundation grant 901/10.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eugene A. Feinberg.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Denardo, E.V., Feinberg, E.A. & Rothblum, U.G. The multi-armed bandit, with constraints. Ann Oper Res 208, 37–62 (2013). https://doi.org/10.1007/s10479-012-1250-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-012-1250-y

Keywords