Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3692070.3693846guideproceedingsArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Incentivized learning in principal-agent bandit games

Published: 21 July 2024 Publication History

Abstract

This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent. The principal and the agent have misaligned objectives and the choice of action is only left to the agent. However, the principal can influence the agent's decisions by offering incentives which add up to his rewards. The principal aims to iteratively learn an incentive policy to maximize her own total utility. This framework extends usual bandit problems and is motivated by several practical applications, such as healthcare or ecological taxation, where traditionally used mechanism design theories often overlook the learning aspect of the problem. We present nearly optimal (with respect to a horizon T) learning algorithms for the principal's regret in both multi-armed and linear contextual settings. Finally, we support our theoretical guarantees through numerical experiments.

References

[1]
Abe, N. and Long, P. M. Associative reinforcement learning using linear probabilistic concepts. In ICML, pp. 3-11, 1999.
[2]
Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397-422, 2002.
[3]
Banihashem, K., Hajiaghayi, M., Shin, S., and Slivkins, A. Bandit social learning: Exploration under myopic behavior. arXiv preprint arXiv:2302.07425, 2023.
[4]
Ben-Porat, O., Mansour, Y., Moshkovitz, M., and Taitler, B. Principal-agent reward shaping in mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 9502-9510, 2024.
[5]
Bernasconi, M., Castiglioni, M., Marchesi, A., Gatti, N., and Trovò, F. Sequential information design: Learning to persuade in the dark. Advances in Neural Information Processing Systems, 35:15917-15928, 2022.
[6]
Bertsimas, D. and Vempala, S. Solving convex programs by random walks. Journal of the ACM (JACM), 51(4): 540-556, 2004.
[7]
Boursier, E. and Perchet, V. A survey on multiplayer bandits. arXiv preprint arXiv:2211.16275, 2022.
[8]
Bubeck, S. and Slivkins, A. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pp. 42-1. JMLR Workshop and Conference Proceedings, 2012.
[9]
Burnetas, A. N. and Katehakis, M. N. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122-142, 1996.
[10]
Cai, J., Chen, R., Wainwright, M. J., and Zhao, L. Doubly high-dimensional contextual bandits: An interpretable model for joint assortment-pricing. arXiv preprint arXiv:2309.08634, 2023.
[11]
Castiglioni, M., Celli, A., Marchesi, A., and Gatti, N. Online bayesian persuasion. Advances in Neural Information Processing Systems, 33:16188-16198, 2020.
[12]
Castiglioni, M., Marchesi, A., Celli, A., and Gatti, N. Multi-receiver online bayesian persuasion. In International Conference on Machine Learning, pp. 1314-1323. PMLR, 2021.
[13]
Chen, S., Wang, M., and Yang, Z. Actions speak what you want: Provably sample-efficient reinforcement learning of the quantal stackelberg equilibrium from strategic feedbacks. arXiv preprint arXiv:2307.14085, 2023.
[14]
Cohen, A., Deligkas, A., and Koren, M. Learning approximately optimal contracts. In International Symposium on Algorithmic Game Theory, pp. 331-346. Springer, 2022.
[15]
Cohen, M. C., Lobel, I., and Paes Leme, R. Feature-based dynamic pricing. Management Science, 66(11):4921-4943, 2020.
[16]
Conitzer, V. and Garera, N. Learning algorithms for online principal-agent problems (and selling goods online). In Proceedings of the 23rd international conference on Machine learning, pp. 209-216, 2006.
[17]
Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. 2008.
[18]
Den Boer, A. V. Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in operations research and management science, 20(1): 1-18, 2015.
[19]
Dogan, I., Shen, Z.-J. M., and Aswani, A. Estimating and incentivizing imperfect-knowledge agents with hidden rewards. arXiv preprint arXiv:2308.06717, 2023a.
[20]
Dogan, I., Shen, Z.-J. M., and Aswani, A. Repeated principal-agent games with unobserved agent rewards and perfect-knowledge agents. arXiv preprint arXiv:2304.07407, 2023b.
[21]
Doval, L. and Ely, J. C. Sequential information design. Econometrica, 88(6):2575-2608, 2020.
[22]
Evans, K. J., Terhorst, A., and Kang, B. H. From data to decisions: helping crop producers build their actionable knowledge. Critical reviews in plant sciences, 36(2): 71-88, 2017.
[23]
Gittins, J. C. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):148-164, 1979.
[24]
Golrezaei, N., Jaillet, P., and Liang, J. C. N. Incentive-aware contextual pricing with non-parametric market noise. In International Conference on Artificial Intelligence and Statistics, pp. 9331-9361. PMLR, 2023.
[25]
Grötschel, M., Lovász, L., and Schrijver, A. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
[26]
He, J., Zhou, D., Zhang, T., and Gu, Q. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. Advances in Neural Information Processing Systems, 35:34614-34625, 2022.
[27]
Hu, X., Ngo, D., Slivkins, A., and Wu, S. Z. Incentivizing combinatorial bandit exploration. Advances in Neural Information Processing Systems, 35:37173-37183, 2022.
[28]
Javanmard, A. and Nazerzadeh, H. Dynamic pricing in high-dimensions. The Journal of Machine Learning Research, 20(1):315-363, 2019.
[29]
Kamenica, E. and Gentzkow, M. Bayesian persuasion. American Economic Review, 101(6):2590-2615, 2011.
[30]
Kapoor, S., Patel, K. K., and Kar, P. Corruption-tolerant bandit learning. Machine Learning, 108(4):687-715, 2019.
[31]
Laffont, J.-J. and Martimort, D. The theory of incentives: the principal-agent model. In The Theory of Incentives. Princeton University Press, 2009.
[32]
Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4-22, 1985.
[33]
Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, 2020.
[34]
Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on the World Wide Web, pp. 661-670, 2010.
[35]
Lobel, I., Leme, R. P., and Vladu, A. Multidimensional binary search for contextual decision-making. Operations Research, 66(5):1346-1361, 2018.
[36]
Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114-122, 2018.
[37]
Mansour, Y., Slivkins, A., and Syrgkanis, V. Bayesian incentive-compatible bandit exploration. Operations Research, 68(4):1132-1161, 2020.
[38]
Mao, J., Leme, R., and Schneider, J. Contextual pricing for lipschitz buyers. Advances in Neural Information Processing Systems, 31, 2018.
[39]
Myerson, R. B. Mechanism design. Springer, 1989.
[40]
Rademacher, L. A. Approximating the centroid is hard. In Proceedings of the Twenty-Third Annual Symposium on Computational Geometry, pp. 302-305, 2007.
[41]
Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395-411, 2010.
[42]
Sellke, M. and Slivkins, A. The price of incentivizing exploration: A characterization via thompson sampling and sample complexity. In Proceedings of the 22nd ACM Conference on Economics and Computation, pp. 795-796, 2021.
[43]
Simchowitz, M. and Slivkins, A. Exploration and incentives in reinforcement learning. Operations Research, 2023.
[44]
Slivkins, A. et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1-286, 2019.
[45]
Smith, D. J. and Vamanamurthy, M. K. How small is a unit ball? Mathematics Magazine, 62(2):101-107, 1989.
[46]
Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285-294, 1933.
[47]
Woodroofe, M. A one-armed bandit problem with a concomitant variable. Journal of the American Statistical Association, 74(368):799-806, 1979.
[48]
Yu, C., Liu, J., Nemati, S., and Yin, G. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1-36, 2021.
[49]
Zhong, H., Yang, Z., Wang, Z., and Jordan, M. I. Can reinforcement learning find stackelberg-nash equilibria in general-sum markov games with myopic followers? arXiv preprint arXiv:2112.13521, 2021.
[50]
Zhu, B., Bates, S., Yang, Z., Wang, Y., Jiao, J., and Jordan, M. I. The sample complexity of online contract design. arXiv preprint arXiv:2211.05732, 2022.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICML'24: Proceedings of the 41st International Conference on Machine Learning
July 2024
63010 pages

Publisher

JMLR.org

Publication History

Published: 21 July 2024

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media