research-article

Finite-time regret of thompson sampling algorithms for exponential family multi-armed bandits

AUTHORs:

Anima AnandkumarAuthors Info & Claims

NIPS'22: Proceedings of the 36th International Conference on Neural Information Processing Systems

Article No.: 2788, Pages 38475 - 38487

Published: 28 November 2022 Publication History

Abstract

We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a K-armed bandit with exponential family rewards, ExpTS over a horizon T is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor $\sqrt{\log K}$, and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS+, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms. ExpTS+ is an anytime bandit algorithm and achieves the minimax optimality and asymptotic optimality simultaneously for exponential family reward distributions. Our proof techniques are general and conceptually simple and can be easily applied to analyze standard Thompson sampling with specific reward distributions.

Supplementary Material

Additional material (3600270.3603058_supp.pdf)

Supplemental material.

Download
472.70 KB

References

[1]

Marc Abeille and Alessandro Lazaric. Linear thompson sampling revisited. In Artificial Intelligence and Statistics, pages 176-184. PMLR, 2017. (p. 5.)

[2]

Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1964. (pp. 6 and 41.)

[3]

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39-1, 2012. (pp. 2, 4, and 6.)

[4]

Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, pages 99-107, 2013. (pp. 2, 3, and 4.)

[5]

Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for thompson sampling. Journal of the ACM (JACM), 64(5):30, 2017. (pp. 1, 2, 3, 4, 6, 9, 27, and 29.)

[6]

Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217-226, 2009. (pp. 2, 3, 4, and 8.)

[7]

Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55-65, 2010. (p. 8.)

[8]

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235-256, 2002. (pp. 1 and 2.)

Digital Library

[9]

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48-77, 2002. (p. 1.)

Digital Library

[10]

Jie Bian and Kwang-Sung Jun. Maillard sampling: Boltzmann exploration done optimally. arXiv preprint arXiv:2111.03290, 2021. (pp. 2, 4, and 8.)

[11]

Sébastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for thompson sampling. In Advances in Neural Information Processing Systems, pages 638-646, 2013. (pp. 4 and 5.)

Digital Library

[12]

Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback-leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, pages 1516-1541, 2013. (pp. 4 and 5.)

[13]

Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249-2257, 2011. (p. 2.)

Digital Library

[14]

Rémy Degenne and Vianney Perchet. Anytime optimal algorithms in stochastic multi-armed bandits. In International Conference on Machine Learning, pages 1587-1595. PMLR, 2016. (pp. 4 and 8.)

[15]

Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pages 359-376, 2011. (pp. 4 and 5.)

[16]

Aurélien Garivier, Hédi Hadiji, Pierre Menard, and Gilles Stoltz. Kl-ucb-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints. arXiv preprint arXiv:1805.05071, 2018. (pp. 1 and 2.)

[17]

Peter Harremoës. Bounds on tail probabilities in exponential families. arXiv preprint arXiv:1601.05179, 2016. (pp. 4 and 5.)

[18]

Emil Jeřábek. Dual weak pigeonhole principle, boolean complexity, and derandomization. Annals of Pure and Applied Logic, 129(1-3):1-37, 2004. (pp. 6 and 27.)

[19]

Tianyuan Jin, Jing Tang, Pan Xu, Keke Huang, Xiaokui Xiao, and Quanquan Gu. Almost optimal anytime algorithm for batched multi-armed bandits. In International Conference on Machine Learning, pages 5065-5073. PMLR, 2021. (p. 5.)

[20]

Tianyuan Jin, Pan Xu, Jieming Shi, Xiaokui Xiao, and Quanquan Gu. MOTS: Minimax Optimal Thompson Sampling. In International Conference on Machine Learning, pages 5074-5083. PMLR, 2021. (pp. 2, 3, 4, 5, 6, 9, and 10.)

[21]

Tianyuan Jin, Pan Xu, Xiaokui Xiao, and Quanquan Gu. Double explore-then-commit: Asymptotic optimality and beyond. In Conference on Learning Theory, pages 2584-2633. PMLR, 2021. (pp. 4 and 10.)

[22]

Emilie Kaufmann. On bayesian index policies for sequential resource allocation. arXiv preprint arXiv:1601.01190, 2016. (pp. 2 and 4.)

[23]

Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, pages 199-213. Springer, 2012. (p. 4.)

Digital Library

[24]

Wonyoung Kim, Gi-soo Kim, and Myunghee Cho Paik. Doubly robust thompson sampling with linear payoffs. Advances in Neural Information Processing Systems, 34, 2021. (p. 5.)

[25]

Nathaniel Korda, Emilie Kaufmann, and Remi Munos. Thompson sampling for 1-dimensional exponential family bandits. In Advances in neural information processing systems, pages 1448-1456, 2013. (pp. 2, 3, 4, 5, and 8.)

Digital Library

[26]

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4-22, 1985. (pp. 2, 3, and 5.)

[27]

Tor Lattimore. Optimally confident ucb: Improved regret for finite-armed bandits. arXiv preprint arXiv:1507.07880, 2015. (p. 4.)

[28]

Tor Lattimore. Regret analysis of the finite-horizon gittins index strategy for multi-armed bandits. In Conference on Learning Theory, pages 1214-1245, 2016. (pp. 4, 8, and 9.)

[29]

Tor Lattimore. Refining the confidence level for optimistic bandit strategies. The Journal of Machine Learning Research, 19(1):765-796, 2018. (pp. 1, 2, 4, and 8.)

Digital Library

[30]

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. (pp. 3, 9, and 16.)

[31]

Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences. In Proceedings of the 24th annual Conference On Learning Theory, pages 497-514, 2011. (p. 4.)

[32]

Pierre Ménard and Aurélien Garivier. A minimax and asymptotically optimal algorithm for stochastic bandits. In International Conference on Algorithmic Learning Theory, pages 223-237, 2017. (pp. 2, 4, 5, 8, and 41.)

[33]

Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221-1243, 2014. (pp. 4 and 5.)

Digital Library

[34]

Siwei Wang and Wei Chen. Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pages 5114-5122, 2018. (p. 2.)

[35]

Pan Xu, Hongkai Zheng, Eric V Mazumdar, Kamyar Azizzadenesheli, and Animashree Anandkumar. Langevin monte carlo for contextual bandits. In International Conference on Machine Learning, pages 24830-24850. PMLR, 2022. (p. 8.)

[36]

Tong Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning. arXiv preprint arXiv:2110.00871, 2021. (p. 5.)

Index Terms

Finite-time regret of thompson sampling algorithms for exponential family multi-armed bandits

Index terms have been assigned to the content through auto-classification.

Recommendations

Thompson sampling for budgeted multi-armed bandits
IJCAI'15: Proceedings of the 24th International Conference on Artificial Intelligence

Thompson sampling is one of the earliest randomized algorithms for multi-armed bandits (MAB). In this paper, we extend the Thompson sampling to Budgeted MAB, where there is random cost for pulling an arm and the total cost is constrained by a budget. We ...
Thompson Sampling for Dynamic Multi-armed Bandits
ICMLA '11: Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops - Volume 01

The importance of multi-armed bandit (MAB) problems is on the rise due to their recent application in a large variety of areas such as online advertising, news article selection, wireless networks, and medicinal trials, to name a few. The most common ...
Double Thompson sampling for dueling bandits
NIPS'16: Proceedings of the 30th International Conference on Neural Information Processing Systems

In this paper, we propose a Double Thompson Sampling (D-TS) algorithm for dueling bandit problems. As its name suggests, D-TS selects both the first and the second candidates according to Thompson Sampling. Specifically, D-TS maintains a posterior ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems

November 2022

39114 pages

ISBN:9781713871088

Copyright © 2022 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 28 November 2022

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten