Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2020408.2020445acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Learning to trade off between exploration and exploitation in multiclass bandit prediction

Published: 21 August 2011 Publication History

Abstract

We study multi-class bandit prediction, an online learning problem where the learner only receives a partial feedback in each trial indicating whether the predicted class label is correct. The exploration vs. exploitation tradeoff strategy is a well-known technique for online learning with incomplete feedback (i.e., bandit setup). Banditron [8], a multi-class online learning algorithm for bandit setting, maximizes the run-time gain by balancing between exploration and exploitation with a fixed tradeoff parameter. The performance of Banditron can be quite sensitive to the choice of the tradeoff parameter and therefore effective algorithms to automatically tune this parameter is desirable. In this paper, we propose three learning strategies to automatically adjust the tradeoff parameter for Banditron. Our extensive empirical study with multiple real-world data sets verifies the efficacy of the proposed approach in learning the exploration vs. exploitation tradeoff parameter.

References

[1]
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2--3):235--256, 2002.
[2]
Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E. Schapire. An optimal high probability algorithm for the contextual bandit problem. Computational Research Repository, abs/1002.4058, 2010.
[3]
N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ Pr, 2006.
[4]
Chih-Chung Chang and Chih-Jen Lin. Libsvm : a library for support vector machines, 2001.
[5]
Eyal Even-Dar, Shie Mannor, and Yishay Mansour. PAC bounds for multi-armed bandit and markov decision processes. In COLT '02: Proceedings of the 15th Annual Conference on Computational Learning Theory, pages 255--270, 2002.
[6]
Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079--1105, 2006.
[7]
A. Frank and A. Asuncion. UCI machine learning repository, 2010.
[8]
Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Efficient bandit algorithms for online multiclass prediction. In ICML 2008: Proceedings of the 25th international conference on Machine learning, pages 440--447, 2008.
[9]
John Langford and Zhang Tong. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS 2007: Proceeding of the 20th Annual Conference on Neural Information Processing System, 2007.
[10]
D. D. Lewis, Y. Yang, T. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004.
[11]
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW '10: Proceedings of the 19th international conference on World wide web, pages 661--670, New York, NY, USA, 2010. ACM.
[12]
Wei Li, Xuerui Wang, Ruofei Zhang, Ying Cui, Jianchang Mao, and Rong Jin. Exploitation and exploration in a performance based contextual advertising system. In KDD 2010: Knoledge Discovery and Data Mining, pages 27--36, 2010.
[13]
Shie Mannor and John N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5:623--648, 2004.
[14]
Herbert Robbins. some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58:527--535, 1952.
[15]
Herbert Robins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58(5):527--535, 2010.
[16]
F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65:386--408, 1958.
[17]
Joannès Vermorel and Mehryar Mohri. Multi-armed bandit algorithms and empirical evaluation. In European Conference on Machine Learning, pages 437--448. Springer, 2005.
[18]
Shijun Wang, Rong Jin, and Hamed Valizadegan. A potential-based framework for online multi-class learning with partial feedback. In ISTATS 2010: Artificial Intelligence and Statistics, 2010.
[19]
C. Watkins. Learning from delayed Rewards. PhD thesis, Cambridge, 1989.

Cited By

View all
  • (2021)Multi-Step Look-Ahead Optimization Methods for Dynamic Pricing With Demand LearningIEEE Access10.1109/ACCESS.2021.30875779(88478-88497)Online publication date: 2021
  • (2021)Novel pricing strategies for revenue maximization and demand learning using an exploration–exploitation frameworkSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-021-06047-y25:17(11711-11733)Online publication date: 1-Sep-2021
  • (2018)Word sense disambiguation using hybrid swarm intelligence approachPLOS ONE10.1371/journal.pone.020869513:12(e0208695)Online publication date: 20-Dec-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2011
1446 pages
ISBN:9781450308137
DOI:10.1145/2020408
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bandit feedback
  2. exploration vs. exploitation
  3. multi-class classification
  4. online learning

Qualifiers

  • Research-article

Conference

KDD '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)2
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Multi-Step Look-Ahead Optimization Methods for Dynamic Pricing With Demand LearningIEEE Access10.1109/ACCESS.2021.30875779(88478-88497)Online publication date: 2021
  • (2021)Novel pricing strategies for revenue maximization and demand learning using an exploration–exploitation frameworkSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-021-06047-y25:17(11711-11733)Online publication date: 1-Sep-2021
  • (2018)Word sense disambiguation using hybrid swarm intelligence approachPLOS ONE10.1371/journal.pone.020869513:12(e0208695)Online publication date: 20-Dec-2018
  • (2012)Incremental learning using partial feedback for gesture-based human-swarm interaction2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication10.1109/ROMAN.2012.6343865(898-905)Online publication date: Sep-2012
  • (2011)Applying Multiclass Bandit algorithms to call-type classification2011 IEEE Workshop on Automatic Speech Recognition & Understanding10.1109/ASRU.2011.6163970(431-436)Online publication date: Dec-2011

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media