Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

Published: 01 December 2004 Publication History
  • Get Citation Alerts
  • Abstract

    Policy gradient methods for reinforcement learning avoid some of the undesirable properties of the value function approaches, such as policy degradation (Baxter and Bartlett, 2001). However, the variance of the performance gradient estimates obtained from the simulation is sometimes excessive. In this paper, we consider variance reduction methods that were developed for Monte Carlo estimates of integrals. We study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective. Both can be interpreted as additive control variate variance reduction methods. We consider the expected average reward performance measure, and we focus on the GPOMDP algorithm for estimating performance gradients in partially observable Markov decision processes controlled by stochastic reactive policies. We give bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system. For the baseline technique, we compute the optimal baseline, and show that the popular approach of using the average reward to define the baseline can be suboptimal. For actor-critic algorithms, we show that using the true value function as the critic can be suboptimal. We also discuss algorithms for estimating the optimal baseline and approximate value function.

    References

    [1]
    D. Aberdeen. A survey of approximate methods for solving partially observable markov decision processes. Technical report, Research School of Information Science and Engineering, Australian National University, Australia, 2002.
    [2]
    L. Baird. Gradient descent for general reinforcement learning. In S. A. Solla M. S Kearns and D. A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11. The MIT Press, 1999.
    [3]
    P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and System Sciences, 64(1):133-150, February 2002.
    [4]
    A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:834-846, 1983.
    [5]
    J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319-350, 2001.
    [6]
    J. Baxter, P. L. Bartlett, and L. Weaver. Experiments with infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:351-381, 2001.
    [7]
    S. J. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 262:33-57, 1996.
    [8]
    P. Dayan. Reinforcement comparison. In Proceedings of the 1990 Connectionist Models Summer School, pages 45-51. Morgan Kaufmann, 1990.
    [9]
    J. L. Doob. Measure Theory. Number 143 in Graduate Texts in Mathematics. Springer-Verlag, New York, 1994.
    [10]
    M. Evans and T. Swartz. Approximating integrals via Monte Carlo and deterministic methods. Oxford statistical science series. Oxford University Press, Oxford; New York, 2000.
    [11]
    G. S. Fishman. Monte Carlo: Concepts, Algorithms and Applications. Springer series in operations research. Springer-Verlag, New York, 1996.
    [12]
    P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75-84, 1990.
    [13]
    P. W. Glynn and P. L'Ecuyer. Likelihood ratio gradient estimation for regenerative stochastic recursions. Advances in Applied Probability, 12(4):1019-1053, 1995.
    [14]
    G. R. Grimmett and D. R. Stirzaker. Probability and Random Processes. Oxford University Press, Oxford, 1992.
    [15]
    J. M. Hammersley and D. C. Handscomb. Monte Carlo Methods. Chapman and Hall, New York, 1965.
    [16]
    T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 6, pages 345-352. The MIT Press, 1995.
    [17]
    L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99-134, 1998.
    [18]
    H. Kimura and S. Kobayashi. An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value functions. In International Conference on Machine Learning, pages 278-286, 1998a.
    [19]
    H. Kimura and S. Kobayashi. Reinforcement learning for continuous action using stochastic gradient ascent. In Intelligent Autonomous Systems, volume 5, pages 288-295, 1998b.
    [20]
    H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in POMDPs with function approximation. In D. H. Fisher, editor, International Conference on Machine Learning, pages 152-160, 1997.
    [21]
    H. Kimura, M. Yamamura, and S. Kobayashi. Reinforcement learning by stochastic hill climbing on discounted reward. In International Conference on Machine Learning, pages 295-303, 1995.
    [22]
    V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In T. K. Leen S. A. Solla and K. Müller, editors, Advances in Neural Information Processing Systems, volume 12. The MIT Press, 2000.
    [23]
    V. R. Konda and J. N. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization , 42(4):1143-1166, 2003.
    [24]
    W. S. Lovejoy. A survey of algorithmic methods for partially observed markov decision processes. Annals of Operations Research, 28:47-66, 1991.
    [25]
    P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of markov reward processes. IEEE Transactions on Automatic Control, 46(2):191-209, February 2001.
    [26]
    M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley series in probability and mathematical statistics. Applied probability and statistics. John Wiley & Sons, New York, 1994.
    [27]
    M. I. Reiman and A. Weiss. Sensitivity analysis for simulations via likelihood ratios. Operations Research, 37, 1989.
    [28]
    R. Y. Rubinstein. How to optimize complex stochastic systems from a single sample path by the score function method. Annals of Operations Research, 27:175-211, 1991.
    [29]
    E. Seneta. Non-negative Matrices and Markov Chains. Springer series in statistics. Springer-Verlag, New York, 1981.
    [30]
    S. P. Singh, T. Jaakkola, and M. I. Jordan. Learning without state-estimation in partially observable markovian decision processes. In International Conference on Machine Learning, pages 284- 292, 1994.
    [31]
    R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3: 9-44, 1988.
    [32]
    R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998.
    [33]
    R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In T. K. Leen S. A. Solla and K. Müller, editors, Advances in Neural Information Processing Systems, volume 12. The MIT Press, 2000.
    [34]
    F. Topsøe. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory, 46:1602-1609, 2000.
    [35]
    R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229-256, 1992.

    Cited By

    View all
    • (2024)Variance Reduced Domain Randomization for Reinforcement Learning With Policy GradientIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333033246:2(1031-1048)Online publication date: 1-Feb-2024
    • (2024)Multi-Agent Deep Reinforcement Learning for Coordinated Multipoint in Mobile NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2023.330096221:1(908-924)Online publication date: 1-Feb-2024
    • (2023)ANTNProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666144(450-476)Online publication date: 10-Dec-2023
    • Show More Cited By

    Index Terms

    1. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image The Journal of Machine Learning Research
          The Journal of Machine Learning Research  Volume 5, Issue
          12/1/2004
          1571 pages
          ISSN:1532-4435
          EISSN:1533-7928
          Issue’s Table of Contents

          Publisher

          JMLR.org

          Publication History

          Published: 01 December 2004
          Published in JMLR Volume 5

          Qualifiers

          • Article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)23
          • Downloads (Last 6 weeks)5
          Reflects downloads up to 27 Jul 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Variance Reduced Domain Randomization for Reinforcement Learning With Policy GradientIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333033246:2(1031-1048)Online publication date: 1-Feb-2024
          • (2024)Multi-Agent Deep Reinforcement Learning for Coordinated Multipoint in Mobile NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2023.330096221:1(908-924)Online publication date: 1-Feb-2024
          • (2023)ANTNProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666144(450-476)Online publication date: 10-Dec-2023
          • (2022)ProppoProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602384(29152-29165)Online publication date: 28-Nov-2022
          • (2022)On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgettingProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601449(16203-16220)Online publication date: 28-Nov-2022
          • (2022)Unbiased Asymmetric Reinforcement Learning under Partial ObservabilityProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems10.5555/3535850.3535857(44-52)Online publication date: 9-May-2022
          • (2022)Scalable multirobot planning for informed spatial samplingAutonomous Robots10.1007/s10514-022-10048-746:7(817-829)Online publication date: 1-Oct-2022
          • (2021)Factored policy gradientsProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540680(5481-5493)Online publication date: 6-Dec-2021
          • (2021)SIBRE: Self Improvement Based REwards for Adaptive Feedback in Reinforcement LearningProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3463952.3464175(1607-1609)Online publication date: 3-May-2021
          • (2021)Actor-only Deterministic Policy Gradient via Zeroth-order Gradient Oracles in Action Space2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9518023(1676-1681)Online publication date: 12-Jul-2021
          • Show More Cited By

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Get Access

          Login options

          Full Access

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media