research-article

Open access

Non-Compliant Bandits

Authors:

Branislav Kveton,

Johan Matteo Kruijssen,

Yisu NieAuthors Info & Claims

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Pages 1138 - 1147

https://doi.org/10.1145/3583780.3614990

Published: 21 October 2023 Publication History

Abstract

Bandit algorithms arose as a standard approach to learning better models online. As they become more popular, they are increasingly deployed in complex machine learning pipelines, where their actions can be overwritten. For example, in ranking problems, a list of recommended items can be modified by a downstream algorithm to increase diversity. This may break the classic bandit algorithms and lead to linear regret. Specifically, if the proposed action is not taken, uncertainty in its estimated mean reward may not get reduced. In this work, we study this setting and call it non-compliant bandits; as the agent tries to learn rewarding actions that comply with a downstream task. We propose two algorithms, compliant contextual UCB (CompUCB) and Thompson sampling (CompTS), which learn separate reward and compliance models. The compliance model allows the agent to avoid non-compliant actions. We derive a sublinear regret bound for CompUCB. We also conduct experiments that compare our algorithms to classic bandit baselines. The experiments show failures of the baselines and that we mitigate them by learning compliance models.

Supplementary Material

MP4 File (3583780.3614990-video.mp4)

Presentation video

Download
41.80 MB

References

[1]

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. 2011. Improved Algorithms for Linear Stochastic Bandits. In Advances in Neural Information Processing Systems 24. 2312--2320.

[2]

Marc Abeille and Alessandro Lazaric. 2017. Linear Thompson Sampling Revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics.

[3]

Shipra Agrawal and Navin Goyal. 2012. Analysis of Thompson Sampling for the Multi-Armed Bandit Problem. In Proceeding of the 25th Annual Conference on Learning Theory. 39.1--39.26.

[4]

Shipra Agrawal and Navin Goyal. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning. 127--135.

[5]

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, Vol. 47 (2002), 235--256.

Digital Library

[6]

Ruey-Cheng Chen, Luke Gallagher, Roi Blanco, and J. Shane Culpepper. 2017. Efficient Cost-Aware Cascade Ranking in Multi-Stage Retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 445--454.

[7]

Wei Chen, Yajun Wang, and Yang Yuan. 2013. Combinatorial Multi-Armed Bandit: General Framework, Results and Applications. In Proceedings of the 30th International Conference on Machine Learning. 151--159.

Digital Library

[8]

Wang Chi Cheung, Vincent Tan, and Zixin Zhong. 2019. A Thompson Sampling Algorithm for Cascading Bandits. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 438--447.

[9]

Richard Combes, Stefan Magureanu, Alexandre Proutiere, and Cyrille Laroche. 2015. Learning to Rank: Regret Lower Bounds and Efficient Algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.

Digital Library

[10]

Varsha Dani, Thomas Hayes, and Sham Kakade. 2008. Stochastic Linear Optimization under Bandit Feedback. In Proceedings of the 21st Annual Conference on Learning Theory. 355--366.

[11]

Mark Davenport and Justin Romberg. 2016. An Overview of Low-Rank Matrix Recovery From Incomplete Observations. IEEE Journal of Selected Topics in Signal Processing, Vol. 10, 4 (2016), 608--622.

[12]

Qin Ding, Cho-Jui Hsieh, and James Sharpnack. 2021. An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic Gradient Descent and Thompson Sampling. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics.

[13]

Louis Faury, Marc Abeille, Clement Calauzenes, and Olivier Fercoq. 2020. Improved Optimistic Algorithms for Logistic Bandits. In Proceedings of the 37th International Conference on Machine Learning.

Digital Library

[14]

Sarah Filippi, Olivier Cappe, Aurelien Garivier, and Csaba Szepesvari. 2010. Parametric Bandits: The Generalized Linear Case. In Advances in Neural Information Processing Systems 23. 586--594.

[15]

Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. 2012. Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards and Individual Observations. IEEE/ACM Transactions on Networking, Vol. 20, 5 (2012), 1466--1478.

Digital Library

[16]

Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. 2017. Scalable Generalized Linear Bandits: Online Computation and Hashing. In Advances in Neural Information Processing Systems 30. 98--108.

[17]

Nathan Kallus. 2018. Instrument-Armed Bandits. In Proceedings of the 29th International Conference on Algorithmic Learning Theory. 529--546.

[18]

Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2016. DCM Bandits: Learning to Rank with Multiple Clicks. In Proceedings of the 33rd International Conference on Machine Learning. 1215--1224.

[19]

Jaya Kawale, Hung Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay Chawla. 2015. Efficient Thompson Sampling for Online Matrix-Factorization Recommendation. In Advances in Neural Information Processing Systems 28. 1297--1305.

[20]

Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015a. Cascading Bandits: Learning to Rank in the Cascade Model. In Proceedings of the 32nd International Conference on Machine Learning.

[21]

Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015b. Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics.

[22]

Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. 2020. Randomized Exploration in Generalized Linear Bandits. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics.

[23]

Tze Leung Lai and Herbert Robbins. 1985. Asymptotically Efficient Adaptive Allocation Rules. Advances in Applied Mathematics, Vol. 6, 1 (1985), 4--22.

Digital Library

[24]

Shyong Lam and Jon Herlocker. 2016. MovieLens Dataset. http://grouplens.org/datasets/movielens/.

[25]

John Langford and Tong Zhang. 2008. The Epoch-Greedy Algorithm for Contextual Multi-Armed Bandits. In Advances in Neural Information Processing Systems 20. 817--824.

[26]

Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. 2018. TopRank: A Practical Algorithm for Online Stochastic Ranking. In Advances in Neural Information Processing Systems 31. 3949--3958.

[27]

Tor Lattimore and Csaba Szepesvari. 2019. Bandit Algorithms. Cambridge University Press.

[28]

Lihong Li, Wei Chu, John Langford, and Robert Schapire. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web.

Digital Library

[29]

Lihong Li, Yu Lu, and Dengyong Zhou. 2017. Provably Optimal Algorithms for Generalized Linear Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning. 2071--2080.

[30]

Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. 2016. Collaborative Filtering Bandits. In Proceedings of the 39th Annual International ACM SIGIR Conference.

Digital Library

[31]

Yi Liu and Lihong Li. 2021. A Map of Bandits for E-Commerce. In KDD 2021 Workshop on Multi-Armed Bandits and Reinforcement Learning.

[32]

Xiuyuan Lu and Benjamin Van Roy. 2017. Ensemble Sampling. In Advances in Neural Information Processing Systems 30. 3258--3266.

[33]

Nicolas Della Penna, Mark Reid, and David Balduzzi. 2016. Compliance-Aware Bandits. CoRR, Vol. abs/1602.02852 (2016). http://arxiv.org/abs/1602.02852

[34]

Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning Diverse Rankings with Multi-Armed Bandits. In Proceedings of the 25th International Conference on Machine Learning. 784--791.

Digital Library

[35]

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. 2018. A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning, Vol. 11, 1 (2018), 1--96.

Digital Library

[36]

Shubhra Kanti Karmaker Santu, Parikshit Sondhi, and ChengXiang Zhai. 2017. On Application of Learning to Rank for E-Commerce Search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 475--484.

Digital Library

[37]

Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alex Dimakis, and Sanjay Shakkottai. 2017. Contextual Bandits with Latent Confounders: An NMF Approach. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics.

[38]

Andrew Stirn and Tony Jebara. 2018. Thompson Sampling for Noncompliant Bandits. CoRR, Vol. abs/1812.00856 (2018). http://arxiv.org/abs/1812.00856

[39]

William R. Thompson. 1933. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, Vol. 25, 3--4 (1933), 285--294.

[40]

R. Wolke and H. Schwetlick. 1988. Iteratively Reweighted Least Squares: Algorithms, Convergence Analysis, and Numerical Comparisons. SIAM J. Sci. Statist. Comput., Vol. 9, 5 (1988), 907--921.

Digital Library

[41]

Xiaoxue Zhao, Weinan Zhang, and Jun Wang. 2013. Interactive Collaborative Filtering. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 1411--1420.

Digital Library

[42]

Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. 2016. Cascading Bandits for Large-Scale Recommendation Problems. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence.

Index Terms

Non-Compliant Bandits
1. Theory of computation
  1. Design and analysis of algorithms
    1. Online algorithms
      1. Online learning algorithms
  2. Theory and algorithms for application domains
    1. Machine learning theory
      1. Active learning
      2. Reinforcement learning

Recommendations

Tsallis-INF: an optimal algorithm for stochastic and adversarial bandits

We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) with ...
Knows what it knows: a framework for self-aware learning

We introduce a learning framework that combines elements of the well-known PAC and mistake-bound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the ...
Bandits with switching costs: T^2/3 regret
STOC '14: Proceedings of the forty-sixth annual ACM symposium on Theory of computing

We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's T-round minimax regret in this setting is [EQUATION], thereby closing a fundamental gap in our ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

October 2023

5508 pages

ISBN:9798400701245

DOI:10.1145/3583780

General Chairs:
Ingo Frommholz
University of Wolverhampton, UK
,
Frank Hopfgartner
University of Koblenz, Germany
,
Mark Lee
University of Birmingham, UK
,
Michael Oakes
University of Birmingham, UK
,
Program Chairs:
Mounia Lalmas
Spotify, UK
,
Min Zhang
Tsinghua University, China
,
Rodrygo Santos
Federal University of Minas Gerais, Brazil

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2023

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '23

Sponsor:

CIKM '23: The 32nd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2023

Birmingham, United Kingdom

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
269
Total Downloads

Downloads (Last 12 months)209
Downloads (Last 6 weeks)26

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten