Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Bandits with Knapsacks

Published: 01 March 2018 Publication History

Abstract

Multi-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of these application domains, the learner may be constrained by one or more supply (or budget) limits, in addition to the customary limitation on the time horizon. The literature lacks a general model encompassing these sorts of problems. We introduce such a model, called bandits with knapsacks, that combines bandit learning with aspects of stochastic integer programming. In particular, a bandit algorithm needs to solve a stochastic version of the well-known knapsack problem, which is concerned with packing items into a limited-size knapsack. A distinctive feature of our problem, in comparison to the existing regret-minimization literature, is that the optimal policy for a given latent distribution may significantly outperform the policy that plays the optimal fixed arm. Consequently, achieving sublinear regret in the bandits-with-knapsacks problem is significantly more challenging than in conventional bandit problems.
We present two algorithms whose reward is close to the information-theoretic optimum: one is based on a novel “balanced exploration” paradigm, while the other is a primal-dual algorithm that uses multiplicative updates. Further, we prove that the regret achieved by both algorithms is optimal up to polylogarithmic factors. We illustrate the generality of the problem by presenting applications in a number of different domains, including electronic commerce, routing, and scheduling. As one example of a concrete application, we consider the problem of dynamic posted pricing with limited supply and obtain the first algorithm whose regret, with respect to the optimal dynamic policy, is sublinear in the supply.

References

[1]
Ittai Abraham, Omar Alonso, Vasilis Kandylas, and Aleksandrs Slivkins. 2013. Adaptive crowdsourcing algorithms for the bandit survey problem. In Proceedings of the 26th COLT.
[2]
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In Proceedings of the 31st ICML.
[3]
Shipra Agrawal and Nikhil R. Devanur. 2014. Bandits with concave rewards and convex knapsacks. In Proceedings of the 15th ACM EC.
[4]
Shipra Agrawal and Nikhil R. Devanur. 2016. Linear Contextual Bandits with Knapsacks. In Proceedings of the 29th NIPS.
[5]
Shipra Agrawal, Zizhuo Wang, and Yinyu Ye. 2014. A Dynamic Near-Optimal Algorithm for Online Linear Programming. Oper. Res. 62, 4 (2014), 876--890.
[6]
Shipra Agrawal, Nikhil R. Devanur, and Lihong Li. 2016. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Proceedings of the 29th COLT.
[7]
Kareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. 2012. Budget Optimization for Sponsored Search: Censored Learning in MDPs. In Proceedings of the 28th UAI.
[8]
Sanjeev Arora, Elad Hazan, and Satyen Kale. 2012. The Multiplicative Weights Update Method: a Meta-Algorithm and Applications. Theory Comput. 8, 1 (2012), 121--164.
[9]
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 47, 2--3 (2002), 235--256.
[10]
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32, 1 (2002), 48--77. Preliminary version in Proceedings of the 36th IEEE FOCS, 1995.
[11]
Moshe Babaioff, Yogeshwer Sharma, and Aleksandrs Slivkins. 2014. Characterizing Truthful Multi-armed Bandit Mechanisms. SIAM J. Comput. 43, 1 (2014), 194--230. Preliminary version in Proceedings of the 10th ACM EC, 2009.
[12]
Moshe Babaioff, Shaddin Dughmi, Robert D. Kleinberg, and Aleksandrs Slivkins. 2015. Dynamic Pricing with Limited Supply. ACM Trans. Econ. Comput. 3, 1 (2015), 4. Special issue for Proceedings of the 13th ACM EC, 2012.
[13]
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Yaron Singer. 2012. Learning on a budget: posted price mechanisms for online procurement. In Proceedings of the 13th ACM EC. 128--145.
[14]
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2013. Bandits with Knapsacks. In Proceedings of the 54th IEEE FOCS.
[15]
Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. 2014. Resourceful Contextual Bandits. In Proceedings of the 27th COLT.
[16]
Omar Besbes and Assaf Zeevi. 2009. Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms. Oper. Res. 57 (2009), 1407--1420. Issue 6.
[17]
Omar Besbes and Assaf J. Zeevi. 2012. Blind Network Revenue Management. Oper. Res. 60, 6 (2012), 1537--1550.
[18]
Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. 2003. Online learning in online auctions. In Proceedings of the 14th ACM-SIAM SODA. 202--204.
[19]
Arnoud V. Den Boer. 2015. Dynamic pricing and learning: Historical origins, current research, and new directions. Surv. Oper. Res. Manage. Sci. 20, 1 (June 2015).
[20]
Sébastien Bubeck and Nicolo Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found.Trends Mach. Learn. 5, 1 (2012).
[21]
Nicoló Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. 2013. Regret Minimization for Reserve Prices in Second-Price Auctions. In Proceedings of the ACM-SIAM SODA.
[22]
Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. 2011. Contextual Bandits with Linear Payoff Functions. In Proceedings of the 14th AISTATS.
[23]
Varsha Dani, Thomas P. Hayes, and Sham Kakade. 2008. Stochastic Linear Optimization under Bandit Feedback. In Proceedings of the 21th COLT. 355--366.
[24]
Nikhil Devanur and Sham M. Kakade. 2009. The Price of Truthfulness for Pay-Per-Click Auctions. In Proceedings of the 10th ACM EC. 99--106.
[25]
Nikhil R. Devanur and Thomas P. Hayes. 2009. The AdWords problem: Online keyword matching with budgeted bidders under random permutations. In Proceedings of the 10th ACM EC. 71--78.
[26]
Nikhil R. Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A. Wilkens. 2011. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In Proceedings of the 12th ACM EC. 29--38.
[27]
Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. 2013. Multi-Armed Bandit with Budget Constraint and Variable Costs. In Proceedings of the 27th AAAI.
[28]
Miroslav Dudíik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. 2011. Efficient Optimal Leanring for Contextual Bandits. In Proceedings of the 27th UAI.
[29]
Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2002. PAC bounds for multi-armed bandit and Markov decision processes. In Proceedings of the 15th COLT. 255--270.
[30]
Jon Feldman, Monika Henzinger, Nitish Korula, Vahab S. Mirrokni, and Clifford Stein. 2010. Online Stochastic Packing Applied to Display Ad Allocation. In Proceedings of the 18th ESA. 182--194.
[31]
Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (1997), 119--139.
[32]
Naveen Garg and Jochen Könemann. 2007. Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM J. Comput. 37, 2 (2007), 630--652.
[33]
Sudipta Guha and Kamesh Munagala. 2007. Multi-armed Bandits with Metric Switching Costs. In Proceedings of the 36th ICALP. 496--507.
[34]
Anupam Gupta, Ravishankar Krishnaswamy, Marco Molinaro, and R. Ravi. 2011. Approximation Algorithms for Correlated Knapsacks and Non-martingale Bandits. In Proceedings of the 52nd IEEE FOCS. 827--836.
[35]
András György, Levente Kocsis, Ivett Szabó, and Csaba Szepesvári. 2007. Continuous Time Associative Bandit Problems. In Proceedings of the 20th IJCAI. 830--835.
[36]
Elad Hazan and Nimrod Megiddo. 2007. Online Learning with Prior Information. In Proceedings of the 20th COLT. 499--513.
[37]
Robert Kleinberg. 2004. Nearly Tight Bounds for the Continuum-Armed Bandit Problem. In Proceedings of the 18th NIPS.
[38]
Robert Kleinberg. 2007. Lecture notes for CS 683 (week 2), Cornell University. Retrieved from http://www.cs.cornell.edu/courses/cs683/2007sp/lecnotes/week2.pdf.
[39]
Robert Kleinberg and Tom Leighton. 2003. The Value of Knowing a Demand Curve: Bounds on Regret for Online Posted-Price Auctions. In Proceedings of the 44th IEEE FOCS. 594--605.
[40]
Robert Kleinberg and Aleksandrs Slivkins. 2010. Sharp Dichotomies for Regret Minimization in Metric Spaces. In Proceedings of the 21st ACM-SIAM SODA.
[41]
Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. 2008. Multi-Armed Bandits in Metric Spaces. In Proceedings of the 40th ACM STOC. 681--690.
[42]
Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient Adaptive Allocation Rules. Adv. Appl. Math. 6 (1985), 4--22.
[43]
John Langford and Tong Zhang. 2007. The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits. In Proceedings of the 21st NIPS.
[44]
Nick Littlestone and Manfred K. Warmuth. 1994. The Weighted Majority Algorithm. Info. Comput. 108, 2 (1994), 212--260.
[45]
Tyler Lu, Dávid Pál, and Martin Pál. 2010. Showing Relevant Ads via Lipschitz Context Multi-Armed Bandits. In Proceedings of the 14th AISTATS.
[46]
Marco Molinaro and R. Ravi. 2012. Geometry of Online Packing Linear Programs. In Proceedings of the 39th ICALP. 701--713.
[47]
Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. 2007. Bandits for Taxonomies: A Model-based Approach. In Proceedings of the SDM.
[48]
Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. 2007. Multi-armed Bandit Problems with Dependent Arms. In Proceedings of the 24th ICML.
[49]
Christos H. Papadimitriou and John N. Tsitsiklis. 1999. The complexity of optimal queuing network control. Math. Oper. Res. 24, 2 (1999), 293--305.
[50]
Serge A. Plotkin, David B. Shmoys, and Eva Tardos. 1995. Fast Approximation Algorithms for Fractional Packing and Covering Problems. Math. Oper. Res. 20 (1995), 257--301.
[51]
Adish Singla and Andreas Krause. 2013. Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In Proceedings of the 22nd WWW. 1167--1178.
[52]
Aleksandrs Slivkins and Jennifer Wortman Vaughan. 2013. Online Decision Making in Crowdsourcing Markets: Theoretical Challenges. SIGecom Exch. 12, 2 (December 2013).
[53]
Long Tran-Thanh, Archie Chapman, Enrique Munoz de Cote, Alex Rogers, and Nicholas R. Jennings. 2010. ε-first policies for budget-limited multi-armed bandits. In Proceedings of the 24th AAAI. 1211--1216.
[54]
Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas R. Jennings. 2012. Knapsack based optimal policies for budget-limited multi-armed bandits. In Proceedings of the 26th AAAI. 1134--1140.
[55]
Long Tran-Thanh, Lampros C. Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R. Jennings, and Peter Key. 2014. Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions. In Proceedings of the 30th UAI.
[56]
Zizhuo Wang, Shiming Deng, and Yinyu Ye. 2014. Close the Gaps: A Learning-While-Doing Algorithm for Single-Product Revenue Management Problems. Oper. Res. 62, 2 (2014), 318--331.
[57]
Peter Whittle. 1980. Multi-armed Bandits and the Gittins Index. J. Roy. Stat. Soc., Ser. B 42, 2 (1980), 143--149.

Cited By

View all
  • (2024)Dynamic Matching with Post-Allocation Service and its Application to Refugee ResettlementSSRN Electronic Journal10.2139/ssrn.4748762Online publication date: 2024
  • (2024)Static Pricing for Multi-unit Prophet InequalitiesOperations Research10.1287/opre.2023.003172:4(1388-1399)Online publication date: Jul-2024
  • (2024)Online Learning and Pricing for Service Systems with Reusable ResourcesOperations Research10.1287/opre.2022.238172:3(1203-1241)Online publication date: 1-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 65, Issue 3
June 2018
285 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/3191817
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2018
Accepted: 01 November 2017
Revised: 01 September 2017
Received: 01 July 2015
Published in JACM Volume 65, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Multi-armed bandits
  2. dynamic pricing
  3. knapsack constraints

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • NSF
  • Microsoft Research
  • Google

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)700
  • Downloads (Last 6 weeks)104
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Dynamic Matching with Post-Allocation Service and its Application to Refugee ResettlementSSRN Electronic Journal10.2139/ssrn.4748762Online publication date: 2024
  • (2024)Static Pricing for Multi-unit Prophet InequalitiesOperations Research10.1287/opre.2023.003172:4(1388-1399)Online publication date: Jul-2024
  • (2024)Online Learning and Pricing for Service Systems with Reusable ResourcesOperations Research10.1287/opre.2022.238172:3(1203-1241)Online publication date: 1-May-2024
  • (2024)Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision ProcessesMathematics of Operations Research10.1287/moor.2023.137549:2(880-900)Online publication date: 1-May-2024
  • (2024)Bilateral TradeMathematics of Operations Research10.1287/moor.2023.135149:1(171-203)Online publication date: 1-Feb-2024
  • (2024)The Online Saddle Point Problem and Online Convex Optimization with KnapsacksMathematics of Operations Research10.1287/moor.2018.0330Online publication date: 12-Jan-2024
  • (2024)Network Revenue Management With Demand Learning and Fair Resource-Consumption BalancingProduction and Operations Management10.1177/1059147823122517633:2(494-511)Online publication date: 6-Mar-2024
  • (2024)Learning to Bid the Interest Rate in Online Unsecured Personal LoansProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671584(5150-5160)Online publication date: 25-Aug-2024
  • (2024)No-Regret Learning in Bilateral Trade via Global Budget BalanceProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649653(247-258)Online publication date: 10-Jun-2024
  • (2024)AoI-Guaranteed Bandit: Information Gathering Over Unreliable ChannelsIEEE Transactions on Mobile Computing10.1109/TMC.2024.336786923:10(9469-9486)Online publication date: 1-Oct-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media