research-article

Finite-time frequentist regret bounds of multi-agent thompson sampling on sparse hypergraphs

AUTHORs:

Pan XuAuthors Info & Claims

AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence

Article No.: 1445, Pages 12956 - 12964

https://doi.org/10.1609/aaai.v38i11.29193

Published: 20 February 2024 Publication History

Abstract

We study the multi-agent multi-armed bandit (MAMAB) problem, where agents are factored into overlapping groups. Each group represents a hyperedge, forming a hypergraph over the agents. At each round of interaction, the learner pulls a joint arm (composed of individual arms for each agent) and receives a reward according to the hypergraph structure. Specifically, we assume there is a local reward for each hyperedge, and the reward of the joint arm is the sum of these local rewards. Previous work introduced the multi-agent Thompson sampling (MATS) algorithm and derived a Bayesian regret bound. However, it remains an open problem how to derive a frequentist regret bound for Thompson sampling in this multi-agent setting. To address these issues, we propose an efficient variant of MATS, the epsilon-exploring Multi-Agent Thompson Sampling (e-MATS) algorithm, which performs MATS exploration with probability epsilon while adopts a greedy policy otherwise. We prove that e-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size. We also derive a lower bound for this setting, which implies our frequentist regret upper bound is optimal up to constant and logarithm terms, when the hypergraph is sufficiently sparse. Thorough experiments on standard MAMAB problems demonstrate the superior performance and the improved computational efficiency of ε-MATS compared with existing algorithms in the same setting.

References

[1]

Abramowitz, M.; and Stegun, I. A. 1964. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, volume 55. US Government printing office.

[2]

Agarwal, M.; Aggarwal, V.; and Azizzadenesheli, K. 2022. Multi-Agent Multi-Armed Bandits with Limited Communication. Journal of Machine Learning Research, 23(212): 1-24.

[3]

Agrawal, S.; and Goyal, N. 2012. Analysis of Thompson Sampling for the Multi-Armed Bandit Problem. In Conference on Learning Theory, 39-1. JMLR Workshop and Conference Proceedings.

[4]

Agrawal, S.; and Goyal, N. 2017. Near-Optimal Regret Bounds for Thompson Sampling. Journal of the ACM (JACM), 64(5): 1-24.

Digital Library

[5]

Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The Nonstochastic Multiarmed Bandit Problem. SIAM journal on computing, 32(1): 48-77.

[6]

Bargiacchi, E.; Verstraeten, T.; Roijers, D.; Nowé, A.; and Hasselt, H. 2018. Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems. In International Conference on Machine Learning, 482-490. PMLR.

[7]

Besson, L.; and Kaufmann, E. 2018. Multi-Player Bandits Revisited. In Algorithmic Learning Theory, 56-92. PMLR.

[8]

Boursier, E.; and Perchet, V. 2020. Selfish Robustness and Equilibria in Multi-Player Bandits. In Conference on Learning Theory, 530-581. PMLR.

[9]

Chang, W.; Jafarnia-Jahromi, M.; and Jain, R. 2022. Online Learning for Cooperative Multi-Player Multi-Armed Bandits. In IEEE Conference on Decision and Control (CDC), 7248-7253. IEEE.

[10]

Chapelle, O.; and Li, L. 2011. An Empirical Evaluation of Thompson Sampling. In Advances in Neural Information Processing Systems, 2249-2257.

Digital Library

[11]

Claes, D.; Oliehoek, F.; Baier, H.; and Tuyls, K. 2017. Decentralised Online Planning for Multi-Robot Warehouse Commissioning. In Conference on Autonomous Agents and MultiAgent Systems, 492-500.

[12]

De Hauwere, Y.-M.; Vrancx, P.; and Nowe, A. 2010. Learning Multi-Agent State Space Representations. In International Conference on Autonomous Agents and Multiagent Systems: Volume 1-Volume 1, 715-722.

[13]

Deng, Y.; Zhang, R.; Xu, P.; Ma, J.; and Gu, Q. 2023. PhyGCN: Pre-trained Hypergraph Convolutional Neural Networks with Self-supervised Learning. bioRxiv, 2023-10.

[14]

Elahi, S.; Atalar, B.; Ogut, S.; and Tekin, C. 2021. Contextual Combinatorial Volatile Bandits with Satisfying via Gaussian Processes. arXiv preprint arXiv:2111.14778.

[15]

Gebraad, P. M.; and van Wingerden, J.-W. 2015. Maximum Power-Point Tracking Control for Wind Farms. Wind Energy, 18(3): 429-447.

[16]

Gentile, C.; Li, S.; and Zappella, G. 2014. Online Clustering of Bandits. In International Conference on Machine Learning, 757-765. PMLR.

[17]

Guestrin, C.; Koller, D.; and Parr, R. 2001. Multiagent Planning with Factored MDPs. Advances in neural information processing systems, 14.

[18]

Gupta, S.; Chaudhari, S.; Joshi, G.; and Yağan, O. 2021. Multi-Armed Bandits with Correlated Arms. IEEE Transactions on Information Theory, 67(10): 6711-6732.

Digital Library

[19]

Han, B.; and Arndt, C. 2021. Budget Allocation as a Multi-Agent System of Contextual & Continuous Bandits. In ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2937-2945.

[20]

Hayes, C. F.; Verstraeten, T.; Roijers, D. M.; Howley, E.; and Mannion, P. 2022. Multi-Objective Coordination Graphs for the Expected Scalarised Returns with Generative Flow Models. arXiv preprint arXiv:2207.00368.

[21]

Huang, W.; Combes, R.; and Trinh, C. 2022. Towards Optimal Algorithms for Multi-Player Bandits without Collision Sensing Information. In Conference on Learning Theory, 1990-2012. PMLR.

[22]

Ishfaq, H.; Lan, Q.; Xu, P.; Mahmood, A. R.; Precup, D.; Anandkumar, A.; and Azizzadenesheli, K. 2023. Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo. arXiv preprint arXiv:2305.18246.

[23]

Jin, T.; Xu, P.; Shi, J.; Xiao, X.; and Gu, Q. 2021a. MOTS: Minimax Optimal Thompson Sampling. In International Conference on Machine Learning, 5074-5083. PMLR.

[24]

Jin, T.; Xu, P.; Xiao, X.; and Anandkumar, A. 2022. Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits. Advances in Neural Information Processing Systems, 35: 38475-38487.

[25]

Jin, T.; Xu, P.; Xiao, X.; and Gu, Q. 2021b. Double Explore-Then-Commit: Asymptotic Optimality and Beyond. In Conference on Learning Theory, 2584-2633. PMLR.

[26]

Jin, T.; Yang, X.; Xiao, X.; and Xu, P. 2023. Thompson Sampling with Less Exploration is Fast and Optimal. In International Conference on Machine Learning, 15239-15261. PMLR.

[27]

Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis. In Algorithmic Learning Theory, 199-213. Springer.

[28]

Kocák, T.; and Garivier, A. 2021. Best Arm Identification in Spectral Bandits. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2220-2226.

[29]

Kok, J. R.; and Vlassis, N. 2006. Collaborative Multiagent Reinforcement Learning by Payoff Propagation. Journal of Machine Learning Research, 7: 1789-1828.

Digital Library

[30]

Korda, N.; Kaufmann, E.; and Munos, R. 2013. Thompson Sampling for 1-Dimensional Exponential Family Bandits. Advances in neural information processing systems, 26.

[31]

Landgren, P.; Srivastava, V.; and Leonard, N. E. 2016. Distributed Cooperative Decision-Making in Multiarmed Bandits: Frequentist and Bayesian Algorithms. In IEEE Conference on Decision and Control (CDC), 167-172. IEEE.

[32]

Li, S.; Karatzoglou, A.; and Gentile, C. 2016. Collaborative Filtering Bandits. In ACM SIGIR Conference on Research and Development in Information Retrieval, 539-548.

[33]

Lugosi, G.; and Mehrabian, A. 2022. Multiplayer Bandits without Observing Collision Information. Mathematics of Operations Research, 47(2): 1247-1265.

Digital Library

[34]

Ma, Y.; Huang, T.-K.; and Schneider, J. G. 2015. Active Search and Bandits on Graphs Using Sigma-Optimality. In Uncertainty in Artificial Intelligence, volume 542, 551.

[35]

Magesh, A.; and Veeravalli, V. V. 2019. Multi-User MABs with User Dependent Rewards for Uncoordinated Spectrum Access. In Asilomar Conference on Signals, Systems, and Computers, 969-972. IEEE.

[36]

Mehrabian, A.; Boursier, E.; Kaufmann, E.; and Perchet, V. 2020. A Practical Algorithm for Multiplayer Bandits When Arm Means Vary among Players. In International Conference on Artificial Intelligence and Statistics, 1211-1221. PMLR.

[37]

Nayyar, N.; Kalathil, D.; and Jain, R. 2016. On Regret-Optimal Learning in Decentralized Multiplayer Multiarmed Bandits. IEEE Transactions on Control of Network Systems, 5(1): 597-606.

[38]

Pal, S.; Suggala, A. S.; Shanmugam, K.; and Jain, P. 2023. Optimal Algorithms for Latent Bandits with Cluster Structure. arXiv preprint arXiv:2301.07040.

[39]

Roijers, D. M.; Whiteson, S.; and Oliehoek, F. A. 2015. Computing Convex Coverage Sets for Faster Multi-Objective Coordination. Journal of Artificial Intelligence Research, 52: 399-443.

[40]

Sankararaman, A.; Ganesh, A.; and Shakkottai, S. 2019. Social Learning in Multi Agent Multi Armed Bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3): 1-35.

Digital Library

[41]

Scharpff, J.; Roijers, D.; Oliehoek, F.; Spaan, M.; and de Weerdt, M. 2016. Solving Transition-Independent MultiAgent MDPs with Sparse Interactions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30.

[42]

Shi, C.; Xiong, W.; Shen, C.; and Yang, J. 2020. Decentralized Multi-Player Multi-Armed Bandits with No Collision Information. In International Conference on Artificial Intelligence and Statistics, 1519-1528. PMLR.

[43]

Shi, C.; Xiong, W.; Shen, C.; and Yang, J. 2021. Heterogeneous Multi-Player Multi-Armed Bandits: Closing the Gap and Generalization. Advances in Neural Information Processing Systems, 34: 22392-22404.

[44]

Stranders, R.; Tran-Thanh, L.; Fave, F. M. D.; Rogers, A.; and Jennings, N. R. 2012. DCOPs and Bandits: Exploration and Exploitation in Decentralised Coordination. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, 289-296.

[45]

Szorenyi, B.; Busa-Fekete, R.; Hegedus, I.; Ormándi, R.; Jelasity, M.; and Kegl, B. 2013. Gossip-Based Distributed Stochastic Bandit Algorithms. In International Conference on Machine Learning, 19-27. PMLR.

[46]

Thaker, P.; Malu, M.; Rao, N.; and Dasarathy, G. 2022. Maximizing and Satisficing in Multi-Armed Bandits with Graph Information. Advances in Neural Information Processing Systems, 35: 2019-2032.

[47]

Thompson, W. R. 1933. On the Likelihood That One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3-4): 285-294.

[48]

Valko, M.; Munos, R.; Kveton, B.; and Kocák, T. 2014. Spectral Bandits for Smooth Graph Functions. In International Conference on Machine Learning, 46-54. PMLR.

[49]

Verstraeten, T.; Bargiacchi, E.; Libin, P. J.; Helsen, J.; Roijers, D. M.; and Nowé, A. 2020. Multi-Agent Thompson Sampling for Bandit Applications with Sparse Neighbourhood Structures. Scientific reports, 10(1): 1-13.

[50]

Verstraeten, T.; Daems, P.-J.; Bargiacchi, E.; Roijers, D. M.; Libin, P. J.; and Helsen, J. 2021. Scalable Optimization for Wind Farm Control Using Coordination Graphs. In International Conference on Autonomous Agents and MultiAgent Systems, 1362-1370.

[51]

Verstraeten, T.; Nowé, A.; Keller, J.; Guo, Y.; Sheng, S.; and Helsen, J. 2019. Fleetwide Data-Enabled Reliability Improvement of Wind Turbines. Renewable and Sustainable Energy Reviews, 109: 428-437.

[52]

Wang, P.-A.; Proutiere, A.; Ariu, K.; Jedra, Y.; and Russo, A. 2020a. Optimal Algorithms for Multiplayer Multi-Armed Bandits. In International Conference on Artificial Intelligence and Statistics, 4120-4129. PMLR.

[53]

Wang, Y.; Hu, J.; Chen, X.; and Wang, L. 2020b. Distributed Bandit Learning: Near-optimal Regret with Efficient Communication. In International Conference on Learning Representations.

[54]

Wei, L.; and Srivastava, V. 2018. On Distributed MultiPlayer Multiarmed Bandit Problems in Abruptly Changing Environment. In IEEE Conference on Decision and Control (CDC), 5783-5788. IEEE.

[55]

Wiering, M. A.; et al. 2000. Multi-Agent Reinforcement Learning for Traffic Light Control. In International Conference on Machine Learning, 1151-1158.

[56]

Xu, P.; Wen, Z.; Zhao, H.; and Gu, Q. 2022a. Neural Contextual Bandits with Deep Representation and Shallow Exploration. In International Conference on Learning Representations.

[57]

Xu, P.; Zheng, H.; Mazumdar, E. V.; Azizzadenesheli, K.; and Anandkumar, A. 2022b. Langevin monte carlo for contextual bandits. In International Conference on Machine Learning, 24830-24850. PMLR.

[58]

Yang, K.; Toni, L.; and Dong, X. 2020. Laplacian-Regularized Graph Bandits: Algorithms and Theoretical Analysis. In International Conference on Artificial Intelligence and Statistics, 3133-3143. PMLR.

[59]

Zhang, W.; Zhou, D.; Li, L.; and Gu, Q. 2020. Neural Thompson Sampling. In International Conference on Learning Representations.

[60]

Zhang, Y.; Qu, G.; Xu, P.; Lin, Y.; Chen, Z.; and Wierman, A. 2023. Global convergence of localized policy iteration in networked multi-agent reinforcement learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(1): 1-51.

Digital Library

[61]

Zhu, Z.; Zhu, J.; Liu, J.; and Liu, Y. 2021. Federated Bandit: A Gossiping Approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, 3-4.

Index Terms

Finite-time frequentist regret bounds of multi-agent thompson sampling on sparse hypergraphs

Index terms have been assigned to the content through auto-classification.

Recommendations

Prior-free and prior-dependent regret bounds for Thompson Sampling
NIPS'13: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1

We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and prior-dependent regret bounds, very much in the same spirit than the usual distribution-free and ...
Near-Optimal Regret Bounds for Thompson Sampling

Thompson Sampling (TS) is one of the oldest heuristics for multiarmed bandit problems. It is a randomized algorithm based on Bayesian ideas and has recently generated significant interest after several studies demonstrated that it has favorable ...
Finite-time regret of thompson sampling algorithms for exponential family multi-armed bandits
NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems

We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence

February 2024

23861 pages

ISBN:978-1-57735-887-9

Copyright © 2024 Association for the Advancement of Artificial Intelligence.

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 20 February 2024

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten