Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3583780.3614990acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Open access

Non-Compliant Bandits

Published: 21 October 2023 Publication History

Abstract

Bandit algorithms arose as a standard approach to learning better models online. As they become more popular, they are increasingly deployed in complex machine learning pipelines, where their actions can be overwritten. For example, in ranking problems, a list of recommended items can be modified by a downstream algorithm to increase diversity. This may break the classic bandit algorithms and lead to linear regret. Specifically, if the proposed action is not taken, uncertainty in its estimated mean reward may not get reduced. In this work, we study this setting and call it non-compliant bandits; as the agent tries to learn rewarding actions that comply with a downstream task. We propose two algorithms, compliant contextual UCB (CompUCB) and Thompson sampling (CompTS), which learn separate reward and compliance models. The compliance model allows the agent to avoid non-compliant actions. We derive a sublinear regret bound for CompUCB. We also conduct experiments that compare our algorithms to classic bandit baselines. The experiments show failures of the baselines and that we mitigate them by learning compliance models.

Supplementary Material

MP4 File (3583780.3614990-video.mp4)
Presentation video

References

[1]
Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. 2011. Improved Algorithms for Linear Stochastic Bandits. In Advances in Neural Information Processing Systems 24. 2312--2320.
[2]
Marc Abeille and Alessandro Lazaric. 2017. Linear Thompson Sampling Revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics.
[3]
Shipra Agrawal and Navin Goyal. 2012. Analysis of Thompson Sampling for the Multi-Armed Bandit Problem. In Proceeding of the 25th Annual Conference on Learning Theory. 39.1--39.26.
[4]
Shipra Agrawal and Navin Goyal. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning. 127--135.
[5]
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, Vol. 47 (2002), 235--256.
[6]
Ruey-Cheng Chen, Luke Gallagher, Roi Blanco, and J. Shane Culpepper. 2017. Efficient Cost-Aware Cascade Ranking in Multi-Stage Retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 445--454.
[7]
Wei Chen, Yajun Wang, and Yang Yuan. 2013. Combinatorial Multi-Armed Bandit: General Framework, Results and Applications. In Proceedings of the 30th International Conference on Machine Learning. 151--159.
[8]
Wang Chi Cheung, Vincent Tan, and Zixin Zhong. 2019. A Thompson Sampling Algorithm for Cascading Bandits. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 438--447.
[9]
Richard Combes, Stefan Magureanu, Alexandre Proutiere, and Cyrille Laroche. 2015. Learning to Rank: Regret Lower Bounds and Efficient Algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.
[10]
Varsha Dani, Thomas Hayes, and Sham Kakade. 2008. Stochastic Linear Optimization under Bandit Feedback. In Proceedings of the 21st Annual Conference on Learning Theory. 355--366.
[11]
Mark Davenport and Justin Romberg. 2016. An Overview of Low-Rank Matrix Recovery From Incomplete Observations. IEEE Journal of Selected Topics in Signal Processing, Vol. 10, 4 (2016), 608--622.
[12]
Qin Ding, Cho-Jui Hsieh, and James Sharpnack. 2021. An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic Gradient Descent and Thompson Sampling. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics.
[13]
Louis Faury, Marc Abeille, Clement Calauzenes, and Olivier Fercoq. 2020. Improved Optimistic Algorithms for Logistic Bandits. In Proceedings of the 37th International Conference on Machine Learning.
[14]
Sarah Filippi, Olivier Cappe, Aurelien Garivier, and Csaba Szepesvari. 2010. Parametric Bandits: The Generalized Linear Case. In Advances in Neural Information Processing Systems 23. 586--594.
[15]
Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. 2012. Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards and Individual Observations. IEEE/ACM Transactions on Networking, Vol. 20, 5 (2012), 1466--1478.
[16]
Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. 2017. Scalable Generalized Linear Bandits: Online Computation and Hashing. In Advances in Neural Information Processing Systems 30. 98--108.
[17]
Nathan Kallus. 2018. Instrument-Armed Bandits. In Proceedings of the 29th International Conference on Algorithmic Learning Theory. 529--546.
[18]
Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2016. DCM Bandits: Learning to Rank with Multiple Clicks. In Proceedings of the 33rd International Conference on Machine Learning. 1215--1224.
[19]
Jaya Kawale, Hung Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay Chawla. 2015. Efficient Thompson Sampling for Online Matrix-Factorization Recommendation. In Advances in Neural Information Processing Systems 28. 1297--1305.
[20]
Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015a. Cascading Bandits: Learning to Rank in the Cascade Model. In Proceedings of the 32nd International Conference on Machine Learning.
[21]
Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015b. Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics.
[22]
Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. 2020. Randomized Exploration in Generalized Linear Bandits. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics.
[23]
Tze Leung Lai and Herbert Robbins. 1985. Asymptotically Efficient Adaptive Allocation Rules. Advances in Applied Mathematics, Vol. 6, 1 (1985), 4--22.
[24]
Shyong Lam and Jon Herlocker. 2016. MovieLens Dataset. http://grouplens.org/datasets/movielens/.
[25]
John Langford and Tong Zhang. 2008. The Epoch-Greedy Algorithm for Contextual Multi-Armed Bandits. In Advances in Neural Information Processing Systems 20. 817--824.
[26]
Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. 2018. TopRank: A Practical Algorithm for Online Stochastic Ranking. In Advances in Neural Information Processing Systems 31. 3949--3958.
[27]
Tor Lattimore and Csaba Szepesvari. 2019. Bandit Algorithms. Cambridge University Press.
[28]
Lihong Li, Wei Chu, John Langford, and Robert Schapire. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web.
[29]
Lihong Li, Yu Lu, and Dengyong Zhou. 2017. Provably Optimal Algorithms for Generalized Linear Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning. 2071--2080.
[30]
Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. 2016. Collaborative Filtering Bandits. In Proceedings of the 39th Annual International ACM SIGIR Conference.
[31]
Yi Liu and Lihong Li. 2021. A Map of Bandits for E-Commerce. In KDD 2021 Workshop on Multi-Armed Bandits and Reinforcement Learning.
[32]
Xiuyuan Lu and Benjamin Van Roy. 2017. Ensemble Sampling. In Advances in Neural Information Processing Systems 30. 3258--3266.
[33]
Nicolas Della Penna, Mark Reid, and David Balduzzi. 2016. Compliance-Aware Bandits. CoRR, Vol. abs/1602.02852 (2016). http://arxiv.org/abs/1602.02852
[34]
Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning Diverse Rankings with Multi-Armed Bandits. In Proceedings of the 25th International Conference on Machine Learning. 784--791.
[35]
Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. 2018. A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning, Vol. 11, 1 (2018), 1--96.
[36]
Shubhra Kanti Karmaker Santu, Parikshit Sondhi, and ChengXiang Zhai. 2017. On Application of Learning to Rank for E-Commerce Search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 475--484.
[37]
Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alex Dimakis, and Sanjay Shakkottai. 2017. Contextual Bandits with Latent Confounders: An NMF Approach. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics.
[38]
Andrew Stirn and Tony Jebara. 2018. Thompson Sampling for Noncompliant Bandits. CoRR, Vol. abs/1812.00856 (2018). http://arxiv.org/abs/1812.00856
[39]
William R. Thompson. 1933. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, Vol. 25, 3--4 (1933), 285--294.
[40]
R. Wolke and H. Schwetlick. 1988. Iteratively Reweighted Least Squares: Algorithms, Convergence Analysis, and Numerical Comparisons. SIAM J. Sci. Statist. Comput., Vol. 9, 5 (1988), 907--921.
[41]
Xiaoxue Zhao, Weinan Zhang, and Jun Wang. 2013. Interactive Collaborative Filtering. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 1411--1420.
[42]
Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. 2016. Cascading Bandits for Large-Scale Recommendation Problems. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
October 2023
5508 pages
ISBN:9798400701245
DOI:10.1145/3583780
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2023

Check for updates

Author Tags

  1. active learning
  2. bandits
  3. exploration
  4. online learning
  5. reinforcement learning

Qualifiers

  • Research-article

Conference

CIKM '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 269
    Total Downloads
  • Downloads (Last 12 months)209
  • Downloads (Last 6 weeks)26
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media