Abstract
Presented in this paper is a self-contained analysis of a Markov decision problem that is known as the multi-armed bandit. The analysis covers the cases of linear and exponential utility functions. The optimal policy is shown to have a simple and easily-implemented form. Procedures for computing such a policy are presented, as are procedures for computing the expected utility that it earns, given any starting state. For the case of linear utility, constraints that link the bandits are introduced, and the constrained optimization problem is solved via column generation. The methodology is novel in several respects, which include the use of elementary row operations to simplify arguments.
Similar content being viewed by others
References
Altman, E. (1999). Constrained Markov decision processes. Boca Raton, USA: Chapman & Hall/CRC.
Bergemann, D., & Välimäkim, J. (2008). Bandit problems. In S. Durlauf & L. Blume (Eds.), The new Palgrave dictionary of economics (Vol. 1, 2nd ed., pp. 336–340). New York: Palgrave Macmillan.
Berry, D. A., & Friestedt, B. (1985). Bandit problems. London: Chapman & Hall.
Bertsimas, D., & Niño-Mora, J. (1993). Conservation laws, extended polymatroids and multi-armed bandit problems: a polyhedral approach to indexable systems. Mathematics of Operations Research, 21, 257–306.
Denardo, E. V. (1967). Contraction mappings in the theory underlying dynamic programming. SIAM Review, 9, 165–177.
Denardo, E. V., Park, H., & Rothblum, U. G. (2007). Risk-sensitive and risk-neutral multiarmed bandits. Mathematics of Operations Research, 32, 374–394.
Denardo, E. V., & Rothblum, U. G. (2006). A turnpike theorem for a risk-sensitive Markov decision problem with stopping. SIAM Journal on Control and Optimization, 45, 414–431.
El Karoui, N., & Karatzas, I. (1994). Dynamic allocation indices in continuous time. The Annals of Applied Probability, 4, 255–286.
Feinberg, E. A., & Rothblum, U. G. (2012). Splitting randomized stationary policies in total—reward Markov decision processes. Mathematics of Operations Research, 37, 129–153.
Gittins, J. C. (1979). Bandit problems and dynamic allocation indices (with discussion). Journal of the Royal Statistical Society. Series B, 41, 148–177.
Gittins, J. C. (1989). Multi-armed bandit allocation indices. Chichester: Wiley
Gittins, J. C., & Jones, D. M. (1974). A dynamic allocation index for the sequential design experiments. In J. Gani, K. Sarkadu, & I. Vince (Eds.), Progress in statistics, European meeting of statisticians I (pp. 241–266). Amsterdam: North-Holland.
Gittins, J. C., Glazebrook, K., & Weber, R. (2011). Multi-armed bandit allocation indices (2nd ed.). Chichester: Wiley
Kaspi, H., & Mandelbaum, A. (1998). Multi-armed bandits in discrete and continuous time. The Annals of Applied Probability, 8, 1270–1290.
Katehakis, M. N., & Derman, C. (1986). Computing optimal sequential allocation rules in clinical trials. In J. Van Ryzin (Ed.), IMS lecture notes—monograph series: Vol. 8. Adaptive statistical procedures and related topics (pp. 29–39). Hayward: Inst. Math. Stat.
Katehakis, M. N., & Veinott, A. F. Jr. (1987). The multiarmed bandit problem: decomposition and computation. Mathematics of Operations Research, 22, 262–268.
Niño-Mora, J. (2007). A (2/3)n 3 fast pivoting algorithm for the Gittins index and optimal stopping of a Markov chain. INFORMS Journal on Computing, 10, 596–606.
Schlag, K. (1998). Why imitate, and if so, how? A bounded rational approach to multi-armed bandits. Journal of Economic Theory, 78, 130–156.
Sonin, I. (2008). A generalized Gittins index for Markov chains and its recursive calculation. Statistics & Probability Letters, 78, 1526–1533.
Tsitsiklis, J. (1994). A short proof of the Gittins index theorem. The Annals of Applied Probability, 4, 194–199.
Variaya, P., Walrand, J., & Buyukkoc, C. (1985). Extensions of the multi-armed bandit problem: The discounted case. IEEE Transactions on Automatic Control, AC-30, 426–439.
Veinott, A. F. Jr. (1969). Discrete dynamic programming with sensitive discount optimality criteria. The Annals of Mathematical Statistics, 40, 1635–1660.
Weber, R. (1992). On the Gittins index for multiarmed bandits. The Annals of Applied Probability, 2, 1024–1033.
Weiss, G. (1988). Branching bandit processes. Probability in the Engineering and Informational Sciences, 2, 269–278.
Whittle, P. (1980). Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B, 43, 143–149.
Acknowledgements
The authors are pleased to acknowledge that this paper has benefited immensely from the reactions of Dr. Pelin Cambolat to earlier drafts. This paper has also been improved markedly by two rounds of very careful, thoughtful and constructive refereeing. The research of the second author has been supported in part by NSF grant CMMI-0928490. The research of the third author has been supported in part by ISF Israel Science Foundation grant 901/10.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Denardo, E.V., Feinberg, E.A. & Rothblum, U.G. The multi-armed bandit, with constraints. Ann Oper Res 208, 37–62 (2013). https://doi.org/10.1007/s10479-012-1250-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-012-1250-y