Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1609/aaai.v38i11.29193guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Finite-time frequentist regret bounds of multi-agent thompson sampling on sparse hypergraphs

Published: 20 February 2024 Publication History

Abstract

We study the multi-agent multi-armed bandit (MAMAB) problem, where agents are factored into overlapping groups. Each group represents a hyperedge, forming a hypergraph over the agents. At each round of interaction, the learner pulls a joint arm (composed of individual arms for each agent) and receives a reward according to the hypergraph structure. Specifically, we assume there is a local reward for each hyperedge, and the reward of the joint arm is the sum of these local rewards. Previous work introduced the multi-agent Thompson sampling (MATS) algorithm and derived a Bayesian regret bound. However, it remains an open problem how to derive a frequentist regret bound for Thompson sampling in this multi-agent setting. To address these issues, we propose an efficient variant of MATS, the epsilon-exploring Multi-Agent Thompson Sampling (e-MATS) algorithm, which performs MATS exploration with probability epsilon while adopts a greedy policy otherwise. We prove that e-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size. We also derive a lower bound for this setting, which implies our frequentist regret upper bound is optimal up to constant and logarithm terms, when the hypergraph is sufficiently sparse. Thorough experiments on standard MAMAB problems demonstrate the superior performance and the improved computational efficiency of ε-MATS compared with existing algorithms in the same setting.

References

[1]
Abramowitz, M.; and Stegun, I. A. 1964. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, volume 55. US Government printing office.
[2]
Agarwal, M.; Aggarwal, V.; and Azizzadenesheli, K. 2022. Multi-Agent Multi-Armed Bandits with Limited Communication. Journal of Machine Learning Research, 23(212): 1-24.
[3]
Agrawal, S.; and Goyal, N. 2012. Analysis of Thompson Sampling for the Multi-Armed Bandit Problem. In Conference on Learning Theory, 39-1. JMLR Workshop and Conference Proceedings.
[4]
Agrawal, S.; and Goyal, N. 2017. Near-Optimal Regret Bounds for Thompson Sampling. Journal of the ACM (JACM), 64(5): 1-24.
[5]
Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The Nonstochastic Multiarmed Bandit Problem. SIAM journal on computing, 32(1): 48-77.
[6]
Bargiacchi, E.; Verstraeten, T.; Roijers, D.; Nowé, A.; and Hasselt, H. 2018. Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems. In International Conference on Machine Learning, 482-490. PMLR.
[7]
Besson, L.; and Kaufmann, E. 2018. Multi-Player Bandits Revisited. In Algorithmic Learning Theory, 56-92. PMLR.
[8]
Boursier, E.; and Perchet, V. 2020. Selfish Robustness and Equilibria in Multi-Player Bandits. In Conference on Learning Theory, 530-581. PMLR.
[9]
Chang, W.; Jafarnia-Jahromi, M.; and Jain, R. 2022. Online Learning for Cooperative Multi-Player Multi-Armed Bandits. In IEEE Conference on Decision and Control (CDC), 7248-7253. IEEE.
[10]
Chapelle, O.; and Li, L. 2011. An Empirical Evaluation of Thompson Sampling. In Advances in Neural Information Processing Systems, 2249-2257.
[11]
Claes, D.; Oliehoek, F.; Baier, H.; and Tuyls, K. 2017. Decentralised Online Planning for Multi-Robot Warehouse Commissioning. In Conference on Autonomous Agents and MultiAgent Systems, 492-500.
[12]
De Hauwere, Y.-M.; Vrancx, P.; and Nowe, A. 2010. Learning Multi-Agent State Space Representations. In International Conference on Autonomous Agents and Multiagent Systems: Volume 1-Volume 1, 715-722.
[13]
Deng, Y.; Zhang, R.; Xu, P.; Ma, J.; and Gu, Q. 2023. PhyGCN: Pre-trained Hypergraph Convolutional Neural Networks with Self-supervised Learning. bioRxiv, 2023-10.
[14]
Elahi, S.; Atalar, B.; Ogut, S.; and Tekin, C. 2021. Contextual Combinatorial Volatile Bandits with Satisfying via Gaussian Processes. arXiv preprint arXiv:2111.14778.
[15]
Gebraad, P. M.; and van Wingerden, J.-W. 2015. Maximum Power-Point Tracking Control for Wind Farms. Wind Energy, 18(3): 429-447.
[16]
Gentile, C.; Li, S.; and Zappella, G. 2014. Online Clustering of Bandits. In International Conference on Machine Learning, 757-765. PMLR.
[17]
Guestrin, C.; Koller, D.; and Parr, R. 2001. Multiagent Planning with Factored MDPs. Advances in neural information processing systems, 14.
[18]
Gupta, S.; Chaudhari, S.; Joshi, G.; and Yağan, O. 2021. Multi-Armed Bandits with Correlated Arms. IEEE Transactions on Information Theory, 67(10): 6711-6732.
[19]
Han, B.; and Arndt, C. 2021. Budget Allocation as a Multi-Agent System of Contextual & Continuous Bandits. In ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2937-2945.
[20]
Hayes, C. F.; Verstraeten, T.; Roijers, D. M.; Howley, E.; and Mannion, P. 2022. Multi-Objective Coordination Graphs for the Expected Scalarised Returns with Generative Flow Models. arXiv preprint arXiv:2207.00368.
[21]
Huang, W.; Combes, R.; and Trinh, C. 2022. Towards Optimal Algorithms for Multi-Player Bandits without Collision Sensing Information. In Conference on Learning Theory, 1990-2012. PMLR.
[22]
Ishfaq, H.; Lan, Q.; Xu, P.; Mahmood, A. R.; Precup, D.; Anandkumar, A.; and Azizzadenesheli, K. 2023. Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo. arXiv preprint arXiv:2305.18246.
[23]
Jin, T.; Xu, P.; Shi, J.; Xiao, X.; and Gu, Q. 2021a. MOTS: Minimax Optimal Thompson Sampling. In International Conference on Machine Learning, 5074-5083. PMLR.
[24]
Jin, T.; Xu, P.; Xiao, X.; and Anandkumar, A. 2022. Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits. Advances in Neural Information Processing Systems, 35: 38475-38487.
[25]
Jin, T.; Xu, P.; Xiao, X.; and Gu, Q. 2021b. Double Explore-Then-Commit: Asymptotic Optimality and Beyond. In Conference on Learning Theory, 2584-2633. PMLR.
[26]
Jin, T.; Yang, X.; Xiao, X.; and Xu, P. 2023. Thompson Sampling with Less Exploration is Fast and Optimal. In International Conference on Machine Learning, 15239-15261. PMLR.
[27]
Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis. In Algorithmic Learning Theory, 199-213. Springer.
[28]
Kocák, T.; and Garivier, A. 2021. Best Arm Identification in Spectral Bandits. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2220-2226.
[29]
Kok, J. R.; and Vlassis, N. 2006. Collaborative Multiagent Reinforcement Learning by Payoff Propagation. Journal of Machine Learning Research, 7: 1789-1828.
[30]
Korda, N.; Kaufmann, E.; and Munos, R. 2013. Thompson Sampling for 1-Dimensional Exponential Family Bandits. Advances in neural information processing systems, 26.
[31]
Landgren, P.; Srivastava, V.; and Leonard, N. E. 2016. Distributed Cooperative Decision-Making in Multiarmed Bandits: Frequentist and Bayesian Algorithms. In IEEE Conference on Decision and Control (CDC), 167-172. IEEE.
[32]
Li, S.; Karatzoglou, A.; and Gentile, C. 2016. Collaborative Filtering Bandits. In ACM SIGIR Conference on Research and Development in Information Retrieval, 539-548.
[33]
Lugosi, G.; and Mehrabian, A. 2022. Multiplayer Bandits without Observing Collision Information. Mathematics of Operations Research, 47(2): 1247-1265.
[34]
Ma, Y.; Huang, T.-K.; and Schneider, J. G. 2015. Active Search and Bandits on Graphs Using Sigma-Optimality. In Uncertainty in Artificial Intelligence, volume 542, 551.
[35]
Magesh, A.; and Veeravalli, V. V. 2019. Multi-User MABs with User Dependent Rewards for Uncoordinated Spectrum Access. In Asilomar Conference on Signals, Systems, and Computers, 969-972. IEEE.
[36]
Mehrabian, A.; Boursier, E.; Kaufmann, E.; and Perchet, V. 2020. A Practical Algorithm for Multiplayer Bandits When Arm Means Vary among Players. In International Conference on Artificial Intelligence and Statistics, 1211-1221. PMLR.
[37]
Nayyar, N.; Kalathil, D.; and Jain, R. 2016. On Regret-Optimal Learning in Decentralized Multiplayer Multiarmed Bandits. IEEE Transactions on Control of Network Systems, 5(1): 597-606.
[38]
Pal, S.; Suggala, A. S.; Shanmugam, K.; and Jain, P. 2023. Optimal Algorithms for Latent Bandits with Cluster Structure. arXiv preprint arXiv:2301.07040.
[39]
Roijers, D. M.; Whiteson, S.; and Oliehoek, F. A. 2015. Computing Convex Coverage Sets for Faster Multi-Objective Coordination. Journal of Artificial Intelligence Research, 52: 399-443.
[40]
Sankararaman, A.; Ganesh, A.; and Shakkottai, S. 2019. Social Learning in Multi Agent Multi Armed Bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3): 1-35.
[41]
Scharpff, J.; Roijers, D.; Oliehoek, F.; Spaan, M.; and de Weerdt, M. 2016. Solving Transition-Independent MultiAgent MDPs with Sparse Interactions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30.
[42]
Shi, C.; Xiong, W.; Shen, C.; and Yang, J. 2020. Decentralized Multi-Player Multi-Armed Bandits with No Collision Information. In International Conference on Artificial Intelligence and Statistics, 1519-1528. PMLR.
[43]
Shi, C.; Xiong, W.; Shen, C.; and Yang, J. 2021. Heterogeneous Multi-Player Multi-Armed Bandits: Closing the Gap and Generalization. Advances in Neural Information Processing Systems, 34: 22392-22404.
[44]
Stranders, R.; Tran-Thanh, L.; Fave, F. M. D.; Rogers, A.; and Jennings, N. R. 2012. DCOPs and Bandits: Exploration and Exploitation in Decentralised Coordination. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, 289-296.
[45]
Szorenyi, B.; Busa-Fekete, R.; Hegedus, I.; Ormándi, R.; Jelasity, M.; and Kegl, B. 2013. Gossip-Based Distributed Stochastic Bandit Algorithms. In International Conference on Machine Learning, 19-27. PMLR.
[46]
Thaker, P.; Malu, M.; Rao, N.; and Dasarathy, G. 2022. Maximizing and Satisficing in Multi-Armed Bandits with Graph Information. Advances in Neural Information Processing Systems, 35: 2019-2032.
[47]
Thompson, W. R. 1933. On the Likelihood That One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3-4): 285-294.
[48]
Valko, M.; Munos, R.; Kveton, B.; and Kocák, T. 2014. Spectral Bandits for Smooth Graph Functions. In International Conference on Machine Learning, 46-54. PMLR.
[49]
Verstraeten, T.; Bargiacchi, E.; Libin, P. J.; Helsen, J.; Roijers, D. M.; and Nowé, A. 2020. Multi-Agent Thompson Sampling for Bandit Applications with Sparse Neighbourhood Structures. Scientific reports, 10(1): 1-13.
[50]
Verstraeten, T.; Daems, P.-J.; Bargiacchi, E.; Roijers, D. M.; Libin, P. J.; and Helsen, J. 2021. Scalable Optimization for Wind Farm Control Using Coordination Graphs. In International Conference on Autonomous Agents and MultiAgent Systems, 1362-1370.
[51]
Verstraeten, T.; Nowé, A.; Keller, J.; Guo, Y.; Sheng, S.; and Helsen, J. 2019. Fleetwide Data-Enabled Reliability Improvement of Wind Turbines. Renewable and Sustainable Energy Reviews, 109: 428-437.
[52]
Wang, P.-A.; Proutiere, A.; Ariu, K.; Jedra, Y.; and Russo, A. 2020a. Optimal Algorithms for Multiplayer Multi-Armed Bandits. In International Conference on Artificial Intelligence and Statistics, 4120-4129. PMLR.
[53]
Wang, Y.; Hu, J.; Chen, X.; and Wang, L. 2020b. Distributed Bandit Learning: Near-optimal Regret with Efficient Communication. In International Conference on Learning Representations.
[54]
Wei, L.; and Srivastava, V. 2018. On Distributed MultiPlayer Multiarmed Bandit Problems in Abruptly Changing Environment. In IEEE Conference on Decision and Control (CDC), 5783-5788. IEEE.
[55]
Wiering, M. A.; et al. 2000. Multi-Agent Reinforcement Learning for Traffic Light Control. In International Conference on Machine Learning, 1151-1158.
[56]
Xu, P.; Wen, Z.; Zhao, H.; and Gu, Q. 2022a. Neural Contextual Bandits with Deep Representation and Shallow Exploration. In International Conference on Learning Representations.
[57]
Xu, P.; Zheng, H.; Mazumdar, E. V.; Azizzadenesheli, K.; and Anandkumar, A. 2022b. Langevin monte carlo for contextual bandits. In International Conference on Machine Learning, 24830-24850. PMLR.
[58]
Yang, K.; Toni, L.; and Dong, X. 2020. Laplacian-Regularized Graph Bandits: Algorithms and Theoretical Analysis. In International Conference on Artificial Intelligence and Statistics, 3133-3143. PMLR.
[59]
Zhang, W.; Zhou, D.; Li, L.; and Gu, Q. 2020. Neural Thompson Sampling. In International Conference on Learning Representations.
[60]
Zhang, Y.; Qu, G.; Xu, P.; Lin, Y.; Chen, Z.; and Wierman, A. 2023. Global convergence of localized policy iteration in networked multi-agent reinforcement learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(1): 1-51.
[61]
Zhu, Z.; Zhu, J.; Liu, J.; and Liu, Y. 2021. Federated Bandit: A Gossiping Approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, 3-4.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence
February 2024
23861 pages
ISBN:978-1-57735-887-9

Sponsors

  • Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 20 February 2024

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media