Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3600270.3603058guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article

Finite-time regret of thompson sampling algorithms for exponential family multi-armed bandits

Published: 28 November 2022 Publication History

Abstract

We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a K-armed bandit with exponential family rewards, ExpTS over a horizon T is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor $\sqrt{\log K}$, and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS+, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms. ExpTS+ is an anytime bandit algorithm and achieves the minimax optimality and asymptotic optimality simultaneously for exponential family reward distributions. Our proof techniques are general and conceptually simple and can be easily applied to analyze standard Thompson sampling with specific reward distributions.

Supplementary Material

Additional material (3600270.3603058_supp.pdf)
Supplemental material.

References

[1]
Marc Abeille and Alessandro Lazaric. Linear thompson sampling revisited. In Artificial Intelligence and Statistics, pages 176-184. PMLR, 2017. (p. 5.)
[2]
Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1964. (pp. 6 and 41.)
[3]
Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39-1, 2012. (pp. 2, 4, and 6.)
[4]
Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, pages 99-107, 2013. (pp. 2, 3, and 4.)
[5]
Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for thompson sampling. Journal of the ACM (JACM), 64(5):30, 2017. (pp. 1, 2, 3, 4, 6, 9, 27, and 29.)
[6]
Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217-226, 2009. (pp. 2, 3, 4, and 8.)
[7]
Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55-65, 2010. (p. 8.)
[8]
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235-256, 2002. (pp. 1 and 2.)
[9]
Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48-77, 2002. (p. 1.)
[10]
Jie Bian and Kwang-Sung Jun. Maillard sampling: Boltzmann exploration done optimally. arXiv preprint arXiv:2111.03290, 2021. (pp. 2, 4, and 8.)
[11]
Sébastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for thompson sampling. In Advances in Neural Information Processing Systems, pages 638-646, 2013. (pp. 4 and 5.)
[12]
Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback-leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, pages 1516-1541, 2013. (pp. 4 and 5.)
[13]
Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249-2257, 2011. (p. 2.)
[14]
Rémy Degenne and Vianney Perchet. Anytime optimal algorithms in stochastic multi-armed bandits. In International Conference on Machine Learning, pages 1587-1595. PMLR, 2016. (pp. 4 and 8.)
[15]
Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pages 359-376, 2011. (pp. 4 and 5.)
[16]
Aurélien Garivier, Hédi Hadiji, Pierre Menard, and Gilles Stoltz. Kl-ucb-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints. arXiv preprint arXiv:1805.05071, 2018. (pp. 1 and 2.)
[17]
Peter Harremoës. Bounds on tail probabilities in exponential families. arXiv preprint arXiv:1601.05179, 2016. (pp. 4 and 5.)
[18]
Emil Jeřábek. Dual weak pigeonhole principle, boolean complexity, and derandomization. Annals of Pure and Applied Logic, 129(1-3):1-37, 2004. (pp. 6 and 27.)
[19]
Tianyuan Jin, Jing Tang, Pan Xu, Keke Huang, Xiaokui Xiao, and Quanquan Gu. Almost optimal anytime algorithm for batched multi-armed bandits. In International Conference on Machine Learning, pages 5065-5073. PMLR, 2021. (p. 5.)
[20]
Tianyuan Jin, Pan Xu, Jieming Shi, Xiaokui Xiao, and Quanquan Gu. MOTS: Minimax Optimal Thompson Sampling. In International Conference on Machine Learning, pages 5074-5083. PMLR, 2021. (pp. 2, 3, 4, 5, 6, 9, and 10.)
[21]
Tianyuan Jin, Pan Xu, Xiaokui Xiao, and Quanquan Gu. Double explore-then-commit: Asymptotic optimality and beyond. In Conference on Learning Theory, pages 2584-2633. PMLR, 2021. (pp. 4 and 10.)
[22]
Emilie Kaufmann. On bayesian index policies for sequential resource allocation. arXiv preprint arXiv:1601.01190, 2016. (pp. 2 and 4.)
[23]
Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, pages 199-213. Springer, 2012. (p. 4.)
[24]
Wonyoung Kim, Gi-soo Kim, and Myunghee Cho Paik. Doubly robust thompson sampling with linear payoffs. Advances in Neural Information Processing Systems, 34, 2021. (p. 5.)
[25]
Nathaniel Korda, Emilie Kaufmann, and Remi Munos. Thompson sampling for 1-dimensional exponential family bandits. In Advances in neural information processing systems, pages 1448-1456, 2013. (pp. 2, 3, 4, 5, and 8.)
[26]
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4-22, 1985. (pp. 2, 3, and 5.)
[27]
Tor Lattimore. Optimally confident ucb: Improved regret for finite-armed bandits. arXiv preprint arXiv:1507.07880, 2015. (p. 4.)
[28]
Tor Lattimore. Regret analysis of the finite-horizon gittins index strategy for multi-armed bandits. In Conference on Learning Theory, pages 1214-1245, 2016. (pp. 4, 8, and 9.)
[29]
Tor Lattimore. Refining the confidence level for optimistic bandit strategies. The Journal of Machine Learning Research, 19(1):765-796, 2018. (pp. 1, 2, 4, and 8.)
[30]
Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. (pp. 3, 9, and 16.)
[31]
Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences. In Proceedings of the 24th annual Conference On Learning Theory, pages 497-514, 2011. (p. 4.)
[32]
Pierre Ménard and Aurélien Garivier. A minimax and asymptotically optimal algorithm for stochastic bandits. In International Conference on Algorithmic Learning Theory, pages 223-237, 2017. (pp. 2, 4, 5, 8, and 41.)
[33]
Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221-1243, 2014. (pp. 4 and 5.)
[34]
Siwei Wang and Wei Chen. Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pages 5114-5122, 2018. (p. 2.)
[35]
Pan Xu, Hongkai Zheng, Eric V Mazumdar, Kamyar Azizzadenesheli, and Animashree Anandkumar. Langevin monte carlo for contextual bandits. In International Conference on Machine Learning, pages 24830-24850. PMLR, 2022. (p. 8.)
[36]
Tong Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning. arXiv preprint arXiv:2110.00871, 2021. (p. 5.)

Index Terms

  1. Finite-time regret of thompson sampling algorithms for exponential family multi-armed bandits
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Guide Proceedings
            NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems
            November 2022
            39114 pages

            Publisher

            Curran Associates Inc.

            Red Hook, NY, United States

            Publication History

            Published: 28 November 2022

            Qualifiers

            • Research-article
            • Research
            • Refereed limited

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 0
              Total Downloads
            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 11 Feb 2025

            Other Metrics

            Citations

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media