article

Free access

Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

Authors:

Evan Greensmith,

Peter L. Bartlett,

Jonathan BaxterAuthors Info & Claims

The Journal of Machine Learning Research, Volume 5

Pages 1471 - 1530

Published: 01 December 2004 Publication History

Abstract

Policy gradient methods for reinforcement learning avoid some of the undesirable properties of the value function approaches, such as policy degradation (Baxter and Bartlett, 2001). However, the variance of the performance gradient estimates obtained from the simulation is sometimes excessive. In this paper, we consider variance reduction methods that were developed for Monte Carlo estimates of integrals. We study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective. Both can be interpreted as additive control variate variance reduction methods. We consider the expected average reward performance measure, and we focus on the GPOMDP algorithm for estimating performance gradients in partially observable Markov decision processes controlled by stochastic reactive policies. We give bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system. For the baseline technique, we compute the optimal baseline, and show that the popular approach of using the average reward to define the baseline can be suboptimal. For actor-critic algorithms, we show that using the true value function as the critic can be suboptimal. We also discuss algorithms for estimating the optimal baseline and approximate value function.

References

[1]

D. Aberdeen. A survey of approximate methods for solving partially observable markov decision processes. Technical report, Research School of Information Science and Engineering, Australian National University, Australia, 2002.

[2]

L. Baird. Gradient descent for general reinforcement learning. In S. A. Solla M. S Kearns and D. A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11. The MIT Press, 1999.

Digital Library

[3]

P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and System Sciences, 64(1):133-150, February 2002.

Digital Library

[4]

A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:834-846, 1983.

[5]

J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319-350, 2001.

[6]

J. Baxter, P. L. Bartlett, and L. Weaver. Experiments with infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:351-381, 2001.

Digital Library

[7]

S. J. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 262:33-57, 1996.

Digital Library

[8]

P. Dayan. Reinforcement comparison. In Proceedings of the 1990 Connectionist Models Summer School, pages 45-51. Morgan Kaufmann, 1990.

[9]

J. L. Doob. Measure Theory. Number 143 in Graduate Texts in Mathematics. Springer-Verlag, New York, 1994.

[10]

M. Evans and T. Swartz. Approximating integrals via Monte Carlo and deterministic methods. Oxford statistical science series. Oxford University Press, Oxford; New York, 2000.

[11]

G. S. Fishman. Monte Carlo: Concepts, Algorithms and Applications. Springer series in operations research. Springer-Verlag, New York, 1996.

[12]

P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75-84, 1990.

Digital Library

[13]

P. W. Glynn and P. L'Ecuyer. Likelihood ratio gradient estimation for regenerative stochastic recursions. Advances in Applied Probability, 12(4):1019-1053, 1995.

[14]

G. R. Grimmett and D. R. Stirzaker. Probability and Random Processes. Oxford University Press, Oxford, 1992.

[15]

J. M. Hammersley and D. C. Handscomb. Monte Carlo Methods. Chapman and Hall, New York, 1965.

[16]

T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 6, pages 345-352. The MIT Press, 1995.

[17]

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99-134, 1998.

Digital Library

[18]

H. Kimura and S. Kobayashi. An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value functions. In International Conference on Machine Learning, pages 278-286, 1998a.

Digital Library

[19]

H. Kimura and S. Kobayashi. Reinforcement learning for continuous action using stochastic gradient ascent. In Intelligent Autonomous Systems, volume 5, pages 288-295, 1998b.

[20]

H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in POMDPs with function approximation. In D. H. Fisher, editor, International Conference on Machine Learning, pages 152-160, 1997.

Digital Library

[21]

H. Kimura, M. Yamamura, and S. Kobayashi. Reinforcement learning by stochastic hill climbing on discounted reward. In International Conference on Machine Learning, pages 295-303, 1995.

[22]

V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In T. K. Leen S. A. Solla and K. Müller, editors, Advances in Neural Information Processing Systems, volume 12. The MIT Press, 2000.

[23]

V. R. Konda and J. N. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization , 42(4):1143-1166, 2003.

Digital Library

[24]

W. S. Lovejoy. A survey of algorithmic methods for partially observed markov decision processes. Annals of Operations Research, 28:47-66, 1991.

Digital Library

[25]

P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of markov reward processes. IEEE Transactions on Automatic Control, 46(2):191-209, February 2001.

[26]

M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley series in probability and mathematical statistics. Applied probability and statistics. John Wiley & Sons, New York, 1994.

Digital Library

[27]

M. I. Reiman and A. Weiss. Sensitivity analysis for simulations via likelihood ratios. Operations Research, 37, 1989.

[28]

R. Y. Rubinstein. How to optimize complex stochastic systems from a single sample path by the score function method. Annals of Operations Research, 27:175-211, 1991.

Digital Library

[29]

E. Seneta. Non-negative Matrices and Markov Chains. Springer series in statistics. Springer-Verlag, New York, 1981.

[30]

S. P. Singh, T. Jaakkola, and M. I. Jordan. Learning without state-estimation in partially observable markovian decision processes. In International Conference on Machine Learning, pages 284- 292, 1994.

[31]

R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3: 9-44, 1988.

Digital Library

[32]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998.

Digital Library

[33]

R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In T. K. Leen S. A. Solla and K. Müller, editors, Advances in Neural Information Processing Systems, volume 12. The MIT Press, 2000.

[34]

F. Topsøe. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory, 46:1602-1609, 2000.

Digital Library

[35]

R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229-256, 1992.

Digital Library

Cited By

Jiang YLi CDai WZou JXiong H(2024)Variance Reduced Domain Randomization for Reinforcement Learning With Policy GradientIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333033246:2(1031-1048)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TPAMI.2023.3330332
Schneider SKarl HKhalili RHecker A(2024)Multi-Agent Deep Reinforcement Learning for Coordinated Multipoint in Mobile NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2023.330096221:1(908-924)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TNSM.2023.3300962
Chen ZNewhouse LChen ELuo DSoljačić MOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)ANTNProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666144(450-476)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666144
Show More Cited By

Index Terms

Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
1. Computing methodologies
  1. Machine learning
2. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic representations
      1. Markov networks
    2. Stochastic processes
      1. Markov processes

Index terms have been assigned to the content through auto-classification.

Recommendations

Variance reduction techniques for gradient estimates in reinforcement learning
NIPS'01: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic

We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The first approach we consider is the baseline method, in which a function of the current state is ...
Accelerating stochastic gradient descent using predictive variance reduction
NIPS'13: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1

Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we ...
Near-optimal offline reinforcement learning via double variance reduction
NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

We consider the problem of offline reinforcement learning (RL) — a well-motivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 5, Issue

12/1/2004

1571 pages

ISSN:1532-4435

EISSN:1533-7928

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 December 2004

Published in JMLR Volume 5

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

63
Total Citations
View Citations
814
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)5

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jiang YLi CDai WZou JXiong H(2024)Variance Reduced Domain Randomization for Reinforcement Learning With Policy GradientIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333033246:2(1031-1048)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TPAMI.2023.3330332
Schneider SKarl HKhalili RHecker A(2024)Multi-Agent Deep Reinforcement Learning for Coordinated Multipoint in Mobile NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2023.330096221:1(908-924)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TNSM.2023.3300962
Chen ZNewhouse LChen ELuo DSoljačić MOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)ANTNProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666144(450-476)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666144
Parmas PSeno TKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)ProppoProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602384(29152-29165)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3602384
Korbak TElsahar HKruszewski GDymetmant MKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgettingProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601449(16203-16220)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601449
Baisero AAmato CPelachaud CTaylor MFaliszewski PMascardi V(2022)Unbiased Asymmetric Reinforcement Learning under Partial ObservabilityProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems10.5555/3535850.3535857(44-52)Online publication date: 9-May-2022
https://dl.acm.org/doi/10.5555/3535850.3535857
Manjanna SHsieh MDudek G(2022)Scalable multirobot planning for informed spatial samplingAutonomous Robots10.1007/s10514-022-10048-746:7(817-829)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1007/s10514-022-10048-7
Spooner TVadori NGanesh SRanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)Factored policy gradientsProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540680(5481-5493)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3540680
Nath SVerma RRay AKhadilkar HDignum FLomuscio AEndriss UNowé A(2021)SIBRE: Self Improvement Based REwards for Adaptive Feedback in Reinforcement LearningProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3463952.3464175(1607-1609)Online publication date: 3-May-2021
https://dl.acm.org/doi/10.5555/3463952.3464175
Kumar HKalogerias DPappas GRibeiro A(2021)Actor-only Deterministic Policy Gradient via Zeroth-order Gradient Oracles in Action Space2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9518023(1676-1681)Online publication date: 12-Jul-2021
https://dl.acm.org/doi/10.1109/ISIT45174.2021.9518023
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents