research-article

Incentivized learning in principal-agent bandit games

AUTHORs:

Antoine Scheid,

Daniil Tiapkin,

Etienne Boursier,

Aymeric Capitaine,

Michael I. Jordan,

El-Mahdi El-Mhamdi,

Alain DurmusAuthors Info & Claims

ICML'24: Proceedings of the 41st International Conference on Machine Learning

Article No.: 1776, Pages 43608 - 43631

Published: 21 July 2024 Publication History

Abstract

This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent. The principal and the agent have misaligned objectives and the choice of action is only left to the agent. However, the principal can influence the agent's decisions by offering incentives which add up to his rewards. The principal aims to iteratively learn an incentive policy to maximize her own total utility. This framework extends usual bandit problems and is motivated by several practical applications, such as healthcare or ecological taxation, where traditionally used mechanism design theories often overlook the learning aspect of the problem. We present nearly optimal (with respect to a horizon T) learning algorithms for the principal's regret in both multi-armed and linear contextual settings. Finally, we support our theoretical guarantees through numerical experiments.

References

[1]

Abe, N. and Long, P. M. Associative reinforcement learning using linear probabilistic concepts. In ICML, pp. 3-11, 1999.

[2]

Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397-422, 2002.

Digital Library

[3]

Banihashem, K., Hajiaghayi, M., Shin, S., and Slivkins, A. Bandit social learning: Exploration under myopic behavior. arXiv preprint arXiv:2302.07425, 2023.

[4]

Ben-Porat, O., Mansour, Y., Moshkovitz, M., and Taitler, B. Principal-agent reward shaping in mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 9502-9510, 2024.

[5]

Bernasconi, M., Castiglioni, M., Marchesi, A., Gatti, N., and Trovò, F. Sequential information design: Learning to persuade in the dark. Advances in Neural Information Processing Systems, 35:15917-15928, 2022.

[6]

Bertsimas, D. and Vempala, S. Solving convex programs by random walks. Journal of the ACM (JACM), 51(4): 540-556, 2004.

[7]

Boursier, E. and Perchet, V. A survey on multiplayer bandits. arXiv preprint arXiv:2211.16275, 2022.

[8]

Bubeck, S. and Slivkins, A. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pp. 42-1. JMLR Workshop and Conference Proceedings, 2012.

[9]

Burnetas, A. N. and Katehakis, M. N. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122-142, 1996.

Digital Library

[10]

Cai, J., Chen, R., Wainwright, M. J., and Zhao, L. Doubly high-dimensional contextual bandits: An interpretable model for joint assortment-pricing. arXiv preprint arXiv:2309.08634, 2023.

[11]

Castiglioni, M., Celli, A., Marchesi, A., and Gatti, N. Online bayesian persuasion. Advances in Neural Information Processing Systems, 33:16188-16198, 2020.

[12]

Castiglioni, M., Marchesi, A., Celli, A., and Gatti, N. Multi-receiver online bayesian persuasion. In International Conference on Machine Learning, pp. 1314-1323. PMLR, 2021.

[13]

Chen, S., Wang, M., and Yang, Z. Actions speak what you want: Provably sample-efficient reinforcement learning of the quantal stackelberg equilibrium from strategic feedbacks. arXiv preprint arXiv:2307.14085, 2023.

[14]

Cohen, A., Deligkas, A., and Koren, M. Learning approximately optimal contracts. In International Symposium on Algorithmic Game Theory, pp. 331-346. Springer, 2022.

Digital Library

[15]

Cohen, M. C., Lobel, I., and Paes Leme, R. Feature-based dynamic pricing. Management Science, 66(11):4921-4943, 2020.

Digital Library

[16]

Conitzer, V. and Garera, N. Learning algorithms for online principal-agent problems (and selling goods online). In Proceedings of the 23rd international conference on Machine learning, pp. 209-216, 2006.

Digital Library

[17]

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. 2008.

[18]

Den Boer, A. V. Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in operations research and management science, 20(1): 1-18, 2015.

[19]

Dogan, I., Shen, Z.-J. M., and Aswani, A. Estimating and incentivizing imperfect-knowledge agents with hidden rewards. arXiv preprint arXiv:2308.06717, 2023a.

[20]

Dogan, I., Shen, Z.-J. M., and Aswani, A. Repeated principal-agent games with unobserved agent rewards and perfect-knowledge agents. arXiv preprint arXiv:2304.07407, 2023b.

[21]

Doval, L. and Ely, J. C. Sequential information design. Econometrica, 88(6):2575-2608, 2020.

[22]

Evans, K. J., Terhorst, A., and Kang, B. H. From data to decisions: helping crop producers build their actionable knowledge. Critical reviews in plant sciences, 36(2): 71-88, 2017.

[23]

Gittins, J. C. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):148-164, 1979.

[24]

Golrezaei, N., Jaillet, P., and Liang, J. C. N. Incentive-aware contextual pricing with non-parametric market noise. In International Conference on Artificial Intelligence and Statistics, pp. 9331-9361. PMLR, 2023.

[25]

Grötschel, M., Lovász, L., and Schrijver, A. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.

[26]

He, J., Zhou, D., Zhang, T., and Gu, Q. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. Advances in Neural Information Processing Systems, 35:34614-34625, 2022.

[27]

Hu, X., Ngo, D., Slivkins, A., and Wu, S. Z. Incentivizing combinatorial bandit exploration. Advances in Neural Information Processing Systems, 35:37173-37183, 2022.

[28]

Javanmard, A. and Nazerzadeh, H. Dynamic pricing in high-dimensions. The Journal of Machine Learning Research, 20(1):315-363, 2019.

Digital Library

[29]

Kamenica, E. and Gentzkow, M. Bayesian persuasion. American Economic Review, 101(6):2590-2615, 2011.

[30]

Kapoor, S., Patel, K. K., and Kar, P. Corruption-tolerant bandit learning. Machine Learning, 108(4):687-715, 2019.

Digital Library

[31]

Laffont, J.-J. and Martimort, D. The theory of incentives: the principal-agent model. In The Theory of Incentives. Princeton University Press, 2009.

[32]

Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4-22, 1985.

Digital Library

[33]

Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, 2020.

[34]

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on the World Wide Web, pp. 661-670, 2010.

Digital Library

[35]

Lobel, I., Leme, R. P., and Vladu, A. Multidimensional binary search for contextual decision-making. Operations Research, 66(5):1346-1361, 2018.

Digital Library

[36]

Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114-122, 2018.

Digital Library

[37]

Mansour, Y., Slivkins, A., and Syrgkanis, V. Bayesian incentive-compatible bandit exploration. Operations Research, 68(4):1132-1161, 2020.

Digital Library

[38]

Mao, J., Leme, R., and Schneider, J. Contextual pricing for lipschitz buyers. Advances in Neural Information Processing Systems, 31, 2018.

[39]

Myerson, R. B. Mechanism design. Springer, 1989.

[40]

Rademacher, L. A. Approximating the centroid is hard. In Proceedings of the Twenty-Third Annual Symposium on Computational Geometry, pp. 302-305, 2007.

Digital Library

[41]

Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395-411, 2010.

Digital Library

[42]

Sellke, M. and Slivkins, A. The price of incentivizing exploration: A characterization via thompson sampling and sample complexity. In Proceedings of the 22nd ACM Conference on Economics and Computation, pp. 795-796, 2021.

Digital Library

[43]

Simchowitz, M. and Slivkins, A. Exploration and incentives in reinforcement learning. Operations Research, 2023.

[44]

Slivkins, A. et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1-286, 2019.

Digital Library

[45]

Smith, D. J. and Vamanamurthy, M. K. How small is a unit ball? Mathematics Magazine, 62(2):101-107, 1989.

[46]

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285-294, 1933.

[47]

Woodroofe, M. A one-armed bandit problem with a concomitant variable. Journal of the American Statistical Association, 74(368):799-806, 1979.

[48]

Yu, C., Liu, J., Nemati, S., and Yin, G. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1-36, 2021.

[49]

Zhong, H., Yang, Z., Wang, Z., and Jordan, M. I. Can reinforcement learning find stackelberg-nash equilibria in general-sum markov games with myopic followers? arXiv preprint arXiv:2112.13521, 2021.

[50]

Zhu, B., Bates, S., Yang, Z., Wang, Y., Jiao, J., and Jordan, M. I. The sample complexity of online contract design. arXiv preprint arXiv:2211.05732, 2022.

Index Terms

Incentivized learning in principal-agent bandit games

Index terms have been assigned to the content through auto-classification.

Recommendations

Principal-Agent VCG Contracts
EC '19: Proceedings of the 2019 ACM Conference on Economics and Computation

We study a game of complete information with multiple principals and multiple common agents. Each agent takes an action that can affect the payoffs of all principals. Prat and Rustichini (Econometrica, 2003) who introduce this model assume first price ...
Balanced and Incentivized Learning with Limited Shared Information in Multi-agent Multi-armed Bandit
AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems

Multi-agent multi-armed bandit (MAMAB) is a classic collaborative learning model and has gained much attention in recent years. However, existing studies do not consider the case where an agent may refuse to share all her information with others, e.g., ...
Bandit learning in concave N-person games
NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

This paper examines the long-run behavior of learning with bandit feedback in non-cooperative concave games. The bandit framework accounts for extremely low-information environments where the agents may not even know they are playing a game; as such, ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'24: Proceedings of the 41st International Conference on Machine Learning

July 2024

63010 pages

Copyright © 2024.

Publisher

JMLR.org

Publication History

Published: 21 July 2024

Qualifiers

Research-article
Research
Refereed limited

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten