research-article

Public Access

Bandits with Knapsacks

Authors:

Ashwinkumar Badanidiyuru,

Robert Kleinberg,

Aleksandrs SlivkinsAuthors Info & Claims

Journal of the ACM (JACM), Volume 65, Issue 3

Article No.: 13, Pages 1 - 55

https://doi.org/10.1145/3164539

Published: 01 March 2018 Publication History

Abstract

Multi-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of these application domains, the learner may be constrained by one or more supply (or budget) limits, in addition to the customary limitation on the time horizon. The literature lacks a general model encompassing these sorts of problems. We introduce such a model, called bandits with knapsacks, that combines bandit learning with aspects of stochastic integer programming. In particular, a bandit algorithm needs to solve a stochastic version of the well-known knapsack problem, which is concerned with packing items into a limited-size knapsack. A distinctive feature of our problem, in comparison to the existing regret-minimization literature, is that the optimal policy for a given latent distribution may significantly outperform the policy that plays the optimal fixed arm. Consequently, achieving sublinear regret in the bandits-with-knapsacks problem is significantly more challenging than in conventional bandit problems.

We present two algorithms whose reward is close to the information-theoretic optimum: one is based on a novel “balanced exploration” paradigm, while the other is a primal-dual algorithm that uses multiplicative updates. Further, we prove that the regret achieved by both algorithms is optimal up to polylogarithmic factors. We illustrate the generality of the problem by presenting applications in a number of different domains, including electronic commerce, routing, and scheduling. As one example of a concrete application, we consider the problem of dynamic posted pricing with limited supply and obtain the first algorithm whose regret, with respect to the optimal dynamic policy, is sublinear in the supply.

References

[1]

Ittai Abraham, Omar Alonso, Vasilis Kandylas, and Aleksandrs Slivkins. 2013. Adaptive crowdsourcing algorithms for the bandit survey problem. In Proceedings of the 26th COLT.

[2]

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In Proceedings of the 31st ICML.

Digital Library

[3]

Shipra Agrawal and Nikhil R. Devanur. 2014. Bandits with concave rewards and convex knapsacks. In Proceedings of the 15th ACM EC.

Digital Library

[4]

Shipra Agrawal and Nikhil R. Devanur. 2016. Linear Contextual Bandits with Knapsacks. In Proceedings of the 29th NIPS.

Digital Library

[5]

Shipra Agrawal, Zizhuo Wang, and Yinyu Ye. 2014. A Dynamic Near-Optimal Algorithm for Online Linear Programming. Oper. Res. 62, 4 (2014), 876--890.

Digital Library

[6]

Shipra Agrawal, Nikhil R. Devanur, and Lihong Li. 2016. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Proceedings of the 29th COLT.

[7]

Kareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. 2012. Budget Optimization for Sponsored Search: Censored Learning in MDPs. In Proceedings of the 28th UAI.

Digital Library

[8]

Sanjeev Arora, Elad Hazan, and Satyen Kale. 2012. The Multiplicative Weights Update Method: a Meta-Algorithm and Applications. Theory Comput. 8, 1 (2012), 121--164.

[9]

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 47, 2--3 (2002), 235--256.

Digital Library

[10]

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32, 1 (2002), 48--77. Preliminary version in Proceedings of the 36th IEEE FOCS, 1995.

Digital Library

[11]

Moshe Babaioff, Yogeshwer Sharma, and Aleksandrs Slivkins. 2014. Characterizing Truthful Multi-armed Bandit Mechanisms. SIAM J. Comput. 43, 1 (2014), 194--230. Preliminary version in Proceedings of the 10th ACM EC, 2009.

Digital Library

[12]

Moshe Babaioff, Shaddin Dughmi, Robert D. Kleinberg, and Aleksandrs Slivkins. 2015. Dynamic Pricing with Limited Supply. ACM Trans. Econ. Comput. 3, 1 (2015), 4. Special issue for Proceedings of the 13th ACM EC, 2012.

Digital Library

[13]

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Yaron Singer. 2012. Learning on a budget: posted price mechanisms for online procurement. In Proceedings of the 13th ACM EC. 128--145.

Digital Library

[14]

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2013. Bandits with Knapsacks. In Proceedings of the 54th IEEE FOCS.

Digital Library

[15]

Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. 2014. Resourceful Contextual Bandits. In Proceedings of the 27th COLT.

[16]

Omar Besbes and Assaf Zeevi. 2009. Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms. Oper. Res. 57 (2009), 1407--1420. Issue 6.

Digital Library

[17]

Omar Besbes and Assaf J. Zeevi. 2012. Blind Network Revenue Management. Oper. Res. 60, 6 (2012), 1537--1550.

Digital Library

[18]

Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. 2003. Online learning in online auctions. In Proceedings of the 14th ACM-SIAM SODA. 202--204.

Digital Library

[19]

Arnoud V. Den Boer. 2015. Dynamic pricing and learning: Historical origins, current research, and new directions. Surv. Oper. Res. Manage. Sci. 20, 1 (June 2015).

[20]

Sébastien Bubeck and Nicolo Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found.Trends Mach. Learn. 5, 1 (2012).

[21]

Nicoló Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. 2013. Regret Minimization for Reserve Prices in Second-Price Auctions. In Proceedings of the ACM-SIAM SODA.

Digital Library

[22]

Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. 2011. Contextual Bandits with Linear Payoff Functions. In Proceedings of the 14th AISTATS.

[23]

Varsha Dani, Thomas P. Hayes, and Sham Kakade. 2008. Stochastic Linear Optimization under Bandit Feedback. In Proceedings of the 21th COLT. 355--366.

[24]

Nikhil Devanur and Sham M. Kakade. 2009. The Price of Truthfulness for Pay-Per-Click Auctions. In Proceedings of the 10th ACM EC. 99--106.

Digital Library

[25]

Nikhil R. Devanur and Thomas P. Hayes. 2009. The AdWords problem: Online keyword matching with budgeted bidders under random permutations. In Proceedings of the 10th ACM EC. 71--78.

Digital Library

[26]

Nikhil R. Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A. Wilkens. 2011. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In Proceedings of the 12th ACM EC. 29--38.

Digital Library

[27]

Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. 2013. Multi-Armed Bandit with Budget Constraint and Variable Costs. In Proceedings of the 27th AAAI.

Digital Library

[28]

Miroslav Dudíik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. 2011. Efficient Optimal Leanring for Contextual Bandits. In Proceedings of the 27th UAI.

Digital Library

[29]

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2002. PAC bounds for multi-armed bandit and Markov decision processes. In Proceedings of the 15th COLT. 255--270.

Digital Library

[30]

Jon Feldman, Monika Henzinger, Nitish Korula, Vahab S. Mirrokni, and Clifford Stein. 2010. Online Stochastic Packing Applied to Display Ad Allocation. In Proceedings of the 18th ESA. 182--194.

Digital Library

[31]

Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (1997), 119--139.

Digital Library

[32]

Naveen Garg and Jochen Könemann. 2007. Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM J. Comput. 37, 2 (2007), 630--652.

Digital Library

[33]

Sudipta Guha and Kamesh Munagala. 2007. Multi-armed Bandits with Metric Switching Costs. In Proceedings of the 36th ICALP. 496--507.

Digital Library

[34]

Anupam Gupta, Ravishankar Krishnaswamy, Marco Molinaro, and R. Ravi. 2011. Approximation Algorithms for Correlated Knapsacks and Non-martingale Bandits. In Proceedings of the 52nd IEEE FOCS. 827--836.

Digital Library

[35]

András György, Levente Kocsis, Ivett Szabó, and Csaba Szepesvári. 2007. Continuous Time Associative Bandit Problems. In Proceedings of the 20th IJCAI. 830--835.

Digital Library

[36]

Elad Hazan and Nimrod Megiddo. 2007. Online Learning with Prior Information. In Proceedings of the 20th COLT. 499--513.

Digital Library

[37]

Robert Kleinberg. 2004. Nearly Tight Bounds for the Continuum-Armed Bandit Problem. In Proceedings of the 18th NIPS.

Digital Library

[38]

Robert Kleinberg. 2007. Lecture notes for CS 683 (week 2), Cornell University. Retrieved from http://www.cs.cornell.edu/courses/cs683/2007sp/lecnotes/week2.pdf.

[39]

Robert Kleinberg and Tom Leighton. 2003. The Value of Knowing a Demand Curve: Bounds on Regret for Online Posted-Price Auctions. In Proceedings of the 44th IEEE FOCS. 594--605.

Digital Library

[40]

Robert Kleinberg and Aleksandrs Slivkins. 2010. Sharp Dichotomies for Regret Minimization in Metric Spaces. In Proceedings of the 21st ACM-SIAM SODA.

Digital Library

[41]

Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. 2008. Multi-Armed Bandits in Metric Spaces. In Proceedings of the 40th ACM STOC. 681--690.

Digital Library

[42]

Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient Adaptive Allocation Rules. Adv. Appl. Math. 6 (1985), 4--22.

Digital Library

[43]

John Langford and Tong Zhang. 2007. The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits. In Proceedings of the 21st NIPS.

Digital Library

[44]

Nick Littlestone and Manfred K. Warmuth. 1994. The Weighted Majority Algorithm. Info. Comput. 108, 2 (1994), 212--260.

Digital Library

[45]

Tyler Lu, Dávid Pál, and Martin Pál. 2010. Showing Relevant Ads via Lipschitz Context Multi-Armed Bandits. In Proceedings of the 14th AISTATS.

[46]

Marco Molinaro and R. Ravi. 2012. Geometry of Online Packing Linear Programs. In Proceedings of the 39th ICALP. 701--713.

Digital Library

[47]

Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. 2007. Bandits for Taxonomies: A Model-based Approach. In Proceedings of the SDM.

[48]

Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. 2007. Multi-armed Bandit Problems with Dependent Arms. In Proceedings of the 24th ICML.

Digital Library

[49]

Christos H. Papadimitriou and John N. Tsitsiklis. 1999. The complexity of optimal queuing network control. Math. Oper. Res. 24, 2 (1999), 293--305.

[50]

Serge A. Plotkin, David B. Shmoys, and Eva Tardos. 1995. Fast Approximation Algorithms for Fractional Packing and Covering Problems. Math. Oper. Res. 20 (1995), 257--301.

Digital Library

[51]

Adish Singla and Andreas Krause. 2013. Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In Proceedings of the 22nd WWW. 1167--1178.

Digital Library

[52]

Aleksandrs Slivkins and Jennifer Wortman Vaughan. 2013. Online Decision Making in Crowdsourcing Markets: Theoretical Challenges. SIGecom Exch. 12, 2 (December 2013).

Digital Library

[53]

Long Tran-Thanh, Archie Chapman, Enrique Munoz de Cote, Alex Rogers, and Nicholas R. Jennings. 2010. &epsi;-first policies for budget-limited multi-armed bandits. In Proceedings of the 24th AAAI. 1211--1216.

Digital Library

[54]

Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas R. Jennings. 2012. Knapsack based optimal policies for budget-limited multi-armed bandits. In Proceedings of the 26th AAAI. 1134--1140.

Digital Library

[55]

Long Tran-Thanh, Lampros C. Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R. Jennings, and Peter Key. 2014. Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions. In Proceedings of the 30th UAI.

Digital Library

[56]

Zizhuo Wang, Shiming Deng, and Yinyu Ye. 2014. Close the Gaps: A Learning-While-Doing Algorithm for Single-Product Revenue Management Problems. Oper. Res. 62, 2 (2014), 318--331.

Digital Library

[57]

Peter Whittle. 1980. Multi-armed Bandits and the Gittins Index. J. Roy. Stat. Soc., Ser. B 42, 2 (1980), 143--149.

Cited By

Bansak KLee SManshadi VNiazadeh RPaulson E(2024)Dynamic Matching with Post-Allocation Service and its Application to Refugee ResettlementSSRN Electronic Journal10.2139/ssrn.4748762Online publication date: 2024
https://doi.org/10.2139/ssrn.4748762
Chawla SDevanur NLykouris T(2024)Static Pricing for Multi-unit Prophet InequalitiesOperations Research10.1287/opre.2023.003172:4(1388-1399)Online publication date: Jul-2024
https://doi.org/10.1287/opre.2023.0031
Jia HShi CShen S(2024)Online Learning and Pricing for Service Systems with Reusable ResourcesOperations Research10.1287/opre.2022.238172:3(1203-1241)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1287/opre.2022.2381
Show More Cited By

Index Terms

Bandits with Knapsacks
1. Theory of computation
  1. Design and analysis of algorithms
    1. Online algorithms
      1. Online learning algorithms
  2. Theory and algorithms for application domains
    1. Algorithmic game theory and mechanism design
      1. Algorithmic mechanism design
      2. Computational pricing and auctions
    2. Machine learning theory
      1. Online learning theory
      2. Regret bounds

Recommendations

Adversarial Bandits with Knapsacks
We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-...
Bandits with concave rewards and convex knapsacks
EC '14: Proceedings of the fifteenth ACM conference on Economics and computation

In this paper, we consider a very general model for exploration-exploitation tradeoff which allows arbitrary concave rewards and convex constraints on the decisions across time, in addition to the customary limitation on the time horizon. This model ...
Bandits with Knapsacks
FOCS '13: Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science

Multi-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of ...

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM

Journal of the ACM Volume 65, Issue 3

June 2018

285 pages

ISSN:0004-5411

EISSN:1557-735X

DOI:10.1145/3191817

Editor:
Éva Tardos
Cornell University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2018

Accepted: 01 November 2017

Revised: 01 September 2017

Received: 01 July 2015

Published in JACM Volume 65, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NSF
Microsoft Research
Google

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

90
Total Citations
View Citations
2,735
Total Downloads

Downloads (Last 12 months)700
Downloads (Last 6 weeks)104

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bansak KLee SManshadi VNiazadeh RPaulson E(2024)Dynamic Matching with Post-Allocation Service and its Application to Refugee ResettlementSSRN Electronic Journal10.2139/ssrn.4748762Online publication date: 2024
https://doi.org/10.2139/ssrn.4748762
Chawla SDevanur NLykouris T(2024)Static Pricing for Multi-unit Prophet InequalitiesOperations Research10.1287/opre.2023.003172:4(1388-1399)Online publication date: Jul-2024
https://doi.org/10.1287/opre.2023.0031
Jia HShi CShen S(2024)Online Learning and Pricing for Service Systems with Reusable ResourcesOperations Research10.1287/opre.2022.238172:3(1203-1241)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1287/opre.2022.2381
Emek YLavi RNiazadeh RShi Y(2024)Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision ProcessesMathematics of Operations Research10.1287/moor.2023.137549:2(880-900)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1287/moor.2023.1375
Cesa-Bianchi NCesari TColomboni RFusco FLeonardi S(2024)Bilateral TradeMathematics of Operations Research10.1287/moor.2023.135149:1(171-203)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1287/moor.2023.1351
Rivera Cardoso AWang HXu H(2024)The Online Saddle Point Problem and Online Convex Optimization with KnapsacksMathematics of Operations Research10.1287/moor.2018.0330Online publication date: 12-Jan-2024
https://doi.org/10.1287/moor.2018.0330
Chen XLyu JWang YZhou Y(2024)Network Revenue Management With Demand Learning and Fair Resource-Consumption BalancingProduction and Operations Management10.1177/1059147823122517633:2(494-511)Online publication date: 6-Mar-2024
https://doi.org/10.1177/10591478231225176
Jee DJin SYoo JAhn BBaeza-Yates RBonchi F(2024)Learning to Bid the Interest Rate in Online Unsecured Personal LoansProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671584(5150-5160)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671584
Bernasconi MCastiglioni MCelli AFusco FMohar BShinkar IO'Donnell R(2024)No-Regret Learning in Bilateral Trade via Global Budget BalanceProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649653(247-258)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3618260.3649653
Huang ZWu WFu CChau VLiu XWang JLuo J(2024)AoI-Guaranteed Bandit: Information Gathering Over Unreliable ChannelsIEEE Transactions on Mobile Computing10.1109/TMC.2024.336786923:10(9469-9486)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TMC.2024.3367869
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents