Article

Online Bellman residual and temporal difference algorithms with predictive error guarantees

Authors:

J. Andrew BagnellAuthors Info & Claims

IJCAI'16: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence

Pages 4213 - 4217

Published: 09 July 2016 Publication History

Abstract

We establish connections from optimizing Bellman Residual and Temporal Difference Loss to worst-case long-term predictive error. In the online learning framework, learning takes place over a sequence of trials with the goal of predicting a future discounted sum of rewards. Our first analysis shows that, together with a stability assumption, any no-regret online learning algorithm that minimizes Bellman error ensures small prediction error. Our second analysis shows that applying the family of online mirror descent algorithms on temporal difference loss also ensures small prediction error. No statistical assumptions are made on the sequence of observations, which could be non-Markovian or even adversarial. Our approach thus establishes a broad new family of provably sound algorithms and provides a generalization of previous worst-case results for minimizing predictive error. We investigate the potential advantages of some of this family both theoretically and empirically on benchmark problems.

References

[1]

Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning (ICML 1995) , pages 30-37, 1995.

[2]

Adam Coates, Pieter Abbeel, and Andrew Y Ng. Learning for control from multiple demonstrations. In Proceedings of the 25th international conference on Machine learning (ICML 2008) , pages 144-151, 2008.

Digital Library

[3]

Elad Hazan and Satyen Kale. Projection-free Online Learning. In 29th International Conference on Machine Learning (ICML 2012) , pages 521-528, 2012.

Digital Library

[4]

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. In Proceedings of the 19th annual conference on Computa- tional Learning Theory (COLT 2006) , pages 169-192, 2006.

Digital Library

[5]

Brian Kulis, Peter L Bartlett, Bartlett Eecs, and Berkeley Edu. Implicit Online Learning. In Proceedings of the 27th international conference on Machine learning (ICML 2010) , pages 575-582, 2010.

[6]

Lihong Li. A worst-case comparison between temporal difference and residual gradient with linear function approximation. In Proceedings of the 25th international conference on Machine learning (ICML 2008) , pages 560-567, 2008.

Digital Library

[7]

Stephane Ross and J. Andrew Bagnell. Stability Conditions for Online Learnability. arXiv:1108.3154 , 2011.

[8]

Ankan Saha, Prateek Jain, and Ambuj Tewari. The Interplay Between Stability and Regret in Online Learning. arXiv preprint arXiv:1211.6158 , pages 1-19, 2012.

[9]

Robert E. Schapire and Manfred K. Warmuth. On the worst-case analysis of temporal-difference learning algorithms. Machine Learning , 22(1):95-121, 1996.

Digital Library

[10]

Bruno Scherrer. Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view. International Conference on Machine Learning (ICML 2010) , 2010.

[11]

Ralf Schoknecht and Artur Merke. TD(0) Converges Provably Faster than the Residual Gradient Algorithm. In International Conference on Machine Learning (ICML 2003) , pages 680-687, 2003.

Digital Library

[12]

Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends in Machine Learning , 4(2):107-194, 2011.

Digital Library

[13]

Wen Sun and J. Andrew (Drew) Bagnell. Online Bellman Residual Algorithms with Predictive Error Guarantees. In The 31st Conference on Uncertainty in Artificial Intelligence (UAI 2015) , July 2015.

Digital Library

[14]

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction . MIT Press, 1998.

Digital Library

[15]

R S Sutton. Learning to Predict by the Methods of Temporal Difference. Machine Learning , pages 9- 44, 1988.

Digital Library

[16]

Aviv Tamar, Panos Toulis, Shie Mannor, and Edoardo M. Airoldi. Implicit Temporal Differences. arXiv:1412.6734 , pages 1-6, 2014.

[17]

Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In International Conference on Machine Learning (ICML 2003) , pages 421-422, 2003.

Digital Library

Cited By

Ohnishi MYukawa MJohansson MSugiyama M(2018)Continuous-time value function approximation in reproducing kernel hilbert spacesProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327144.3327205(2818-2829)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327144.3327205

Index Terms

Online Bellman residual and temporal difference algorithms with predictive error guarantees
1. Computing methodologies
  1. Machine learning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Online learning theory

Index terms have been assigned to the content through auto-classification.

Recommendations

Online Bellman Residual algorithms with predictive error guarantees
UAI'15: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence

We establish a connection between optimizing the Bellman Residual and worst case long-term predictive error. In the online learning framework, learning takes place over a sequence of trials with the goal of predicting a future discounted sum of rewards. ...
On generalized Bellman equations and temporal-difference learning

We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the ...
ON THE WORST-CASE ANALYSIS OF TEMPORAL-DIFFERENCE LEARNING ALGORITHMS

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

IJCAI'16: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence

July 2016

4277 pages

ISBN:9781577357704

Editor:
Gerhard Brewka
Leipzig University, Germany

Sponsors

Sony: Sony Corporation
Arizona State University: Arizona State University
Microsoft: Microsoft
Facebook: Facebook
AI Journal: AI Journal

Publisher

AAAI Press

Publication History

Published: 09 July 2016

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ohnishi MYukawa MJohansson MSugiyama M(2018)Continuous-time value function approximation in reproducing kernel hilbert spacesProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327144.3327205(2818-2829)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327144.3327205

View Options

View options

Media

Figures

Other

Tables

View Table of Contents