A monotonic policy optimization algorithm for high-dimensional continuous control problem in 3D MuJoCo

Yuan, Qunyong; Xiao, Nanfeng

doi:10.1007/s11042-018-6098-y

A monotonic policy optimization algorithm for high-dimensional continuous control problem in 3D MuJoCo

Published: 04 June 2018

Volume 78, pages 28665–28680, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Qunyong Yuan¹ &
Nanfeng Xiao¹

505 Accesses
Explore all metrics

Abstract

One challenge in applying reinforcement learning with nonlinear function approximator to high- dimensional continuous control problems is that the update policy produced by the many existed algorithms may fail to improve policy performance or even causes a serious degradation of the policy performance. To address this challenge, this paper proposes a new lower bound on the policy improvement where an average policy divergence on state space is penalized. To the best of our knowledge, this is currently the best result about the lower bound on the policy improvement. Optimizing directly the lower bound on the policy improvement is very difficult, because it demands for high computational overhead. According to the ideal of the trust region policy optimization (TRPO), this paper also presents a monotonic policy optimization algorithm, which is based on the new lower bound on the policy improvement introduced in this paper, it can generate a sequence of monotonically improving policies, and it is suitable for the large-scale continuous control problems. This paper also evaluates and compares the proposed algorithms with some of the existed algorithms on highly challenging robot locomotion tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Path Planning in Unknown Environment Based on Reinforcement Learning

State-Dependent Maximum Entropy Reinforcement Learning for Robot Long-Horizon Task Learning

Article Open access 24 January 2024

Closed-loop control dynamic obstacle avoidance algorithm based on a machine learning objective function

Article 06 June 2024

References

Achiam J (2016) Easy monotonic policy iteration [J]. arXiv:1602. 09118
Duan Y, Chen X, Houthooft R, et al. (2016) Benchmarking deep reinforcement learning for continuous control [J]. Proceedings of The 33rd International Conference on Machine Learning, p 1329–1338
Haviv M, Van Der Heyden, L (1984) Perturbation bounds for the stationary probabilities of a finite markov chain. Adv Appl Probab 16(4):804–818. ISSN 00018678. URL http://www.jstor.org/stable/142734
Kakade, Sham (2001a) A natural policy gradient. In: NIPS, volume 14, p 1531–1538
Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. Nineteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., p 267–274
Krizhevsky A, Sutskever I, Hinton GE (2012) Image net classification with deep convolutional neural networks. In: Advances in neural information processing systems. p 1097–1105
Martens J, Grosse R (2015) Optimizing Neural Networks with Kronecker-factored Approximate Curvature [J]. Proceedings of The 32nd International Conference on Machine Learning, p 2408–2417
Martens J. Sutskever I (2012) Training deep and recurrent networks with hessian-free optimization. In: Neural networks: tricks of the trade. Springer, p 479–535
Mnih V, Kavukcuoglu K, Silver D et al (2015) Human-level control through deep reinforcement learning. [J]. Nature 518(7540):529–533
Article Google Scholar
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning Proceedings of The 33rd International Conference on Machine Learning, p 1928–1937
Ng AY, Harada D, Russell SJ (1999) Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping [C]. Sixteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. p 278–287
Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7):1180–1190
Article Google Scholar
Pirotta M, Restelli M, Pecorino A, et al (2013) Safe policy iteration [C]. International Conference on Machine Learning. p 307–315
Schulman J, Levine S, Moritz P, et al. (2015a) Trust region policy optimization [J]. Comput Sci:1889–1897
Schulman J, Moritz P, Levine S, et al (2015b) High-dimensional continuous control using generalized advantage estimation [J]. arXiv preprint arXiv:1506. 02438
Silver D, Lever G, Heess N, et al (2014) Deterministic policy gradient algorithms [C]. International Conference on Machine Learning p 387–395
Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press
Thomas P, Theocharous G, Ghavamzadeh M (2015) High confidence policy improvement [J]. Proceedings of The 32nd International Conference on Machine Learning p 2380–2388
Thomas PS, Theocharous G, Ghavamzadeh M (2015) High-confidence off-policy evaluation. In AAAI p3000–3006
Todorov E, Erez T, Tassa Y (2012) MuJoCo: a physics engine for model-based control [C]. Proceedings of the. IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE/RSJ International Conference on Intelligent Robots and Systems. p 5026–5033
Van Hasselt H, Guez A, Silver D (2015) Deep reinforcement learning with double Q-learning [J]. CoRR, abs/1509.06461
Wang Z, de Freitas N, Lanctot M (2016) Dueling network architectures for deep reinforcement learning Proceedings of The 33rd International Conference on Machine Learning, p 1995–2003
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3):229–256
MathSciNet MATH Google Scholar
Ye Y (2011) The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a fixed discount rate. Math Oper Res 36(4):593–603
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

This research is funded by the National Natural Science Foundation (Project No. 61573145), the Public Research and Capacity Building of Guangdong Province (Project No. 2014B010104001) and the Basic and Applied Basic Research of Guangdong Province (Project No. 2015A03030 8018), the authors are greatly thanks to these grants.

Author information

Authors and Affiliations

College of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
Qunyong Yuan & Nanfeng Xiao

Authors

Qunyong Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Nanfeng Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qunyong Yuan.

Appendix

A.
Proof of Theorem 1

Proof: for convenience, define

$$ {\delta}_f\left(s,a,{s}^{\prime}\right)=R\left(s,a\right)+\gamma f(s)-f\left({s}^{\prime}\right),\kern0.5em {\overline{\delta}}_f^{\pi^{\prime }}(s)=\underset{a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left({\delta}_f\left(s,a,{s}^{\prime}\right)\right)\left[s\right], $$

(22)

where, s’ is next state given that the agent took action a in state s, by Lemma 1, the identity is obtained

$$ {J}_{\mu}^{\pi^{\prime }}-{J}_{\mu}^{\pi }=\frac{1}{1\hbox{-} \gamma}\left(\underset{s\sim {d}_{\mu}^{\pi^{\prime }},a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left[{\delta}_f\left(s,a,{s}^{\prime}\right)\right]-\underset{s\sim {d}_{\mu}^{\pi },a\sim \pi, {s}^{\prime}\sim P}{E}\left[{\delta}_f\right(s,a,{s}^{\prime}\left)\right]\right) $$

(23)

$ \underset{s\sim {d}_{\mu}^{\pi^{\prime }},a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left[{\delta}_f\left(s,a,{s}^{\prime}\right)\right] $ is rewritten in the inner product form as following

$$ {\displaystyle \begin{array}{l}\underset{s\sim {d}_{\mu}^{\pi^{\prime }},a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left({\delta}_f\left(s,a,{s}^{\prime}\right)\right)=\left\langle {d}_{\mu}^{\pi^{\prime }},{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle =\left\langle {d}_{\mu}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle +\left\langle {d}_{\mu}^{\pi^{\prime }}-{d}_{\mu}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle \\ {}\ge \left\langle {d}_{\mu}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle -{\left\Vert {d}_{\mu}^{\pi^{\prime }}-{d}_{\mu}^{\pi}\right\Vert}_1\frac{\Delta {\overline{\delta}}_f^{\pi^{\prime }}}{2}\kern0.5em \left(\mathrm{by}\ \mathrm{Lemma}\ 3\right)\end{array}} $$

(24)

Different inequalities applied for dealing with the inner product $ \left\langle {d}^{\pi^{\prime }}-{d}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle $ may lead to different lower bounds, such as Holder’s inequality [1]. Here we use the Lemma 3. To our knowledge, the lower bound here derived by us is better than the others. By the importance sampling, the identity can be rewritten as follows

$$ \left\langle {d}_{\mu}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle =\underset{s\sim {d}_{\mu}^{\pi^{\prime }},a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left[{\delta}_f\left(s,a,{s}^{\prime}\right)\right]=\underset{s\sim {d}_{\mu}^{\pi },a\sim \pi, {s}^{\prime}\sim P}{E}\left[\frac{\pi^{\prime}\left(a|s\right)}{\pi \left(a|s\right)}\left(R\left(s,a\right)+\gamma f\left({s}^{\prime}\right)-f(s)\right)\right] $$

(25)

Bring all terms together, the lower bound of the policy improvement can be obtained.

B.
Proof of Corollary 1

Proof: let

$$ f(s)={V}^{\pi }(s),\kern0.5em {\delta}_f\left(s,a,{s}^{\prime}\right)={\delta}_V\left(s,a,{s}^{\prime}\right)=R\left(s,a\right)+\gamma {V}^{\pi}\left({s}^{\prime}\right)-{V}^{\pi }(s) $$

(26)

(TD -error), obviously,

$$ \underset{s\sim {d}_{\mu}^{\pi },a\sim \pi, {s}^{\prime}\sim P}{E}\left[{\delta}_{V^{\pi }}\left(s,a,{s}^{\prime}\right)\right]=\underset{s\sim {d}_{\mu}^{\pi }}{E}\left[{A}^{\pi}\left(s,a\right)\right]=0, $$

(27)

$$ \left\langle {d}_{\mu}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle =\underset{s\sim {d}_{\mu}^{\pi },a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left[{\delta}_f\left(s,a,{s}^{\prime}\right)\right]=\underset{s\sim {d}_{\mu}^{\pi },a\sim {\pi}^{\prime }}{E}\left[\frac{\pi^{\prime}\left(a,s\right)}{\pi \left(a,s\right)}{A}^{\pi}\left(s,a\right)\right] $$

(28)

$$ \Delta {\overline{\delta}}_V^{\pi^{\prime }}={\max}_{i,j}\left|{\overline{\delta}}_V^{\pi^{\prime }}(i)-{\overline{\delta}}_V^{\pi^{\prime }}(j)\right|={\max}_{i,j}\left|\underset{a\sim {\pi}^{\prime }}{E}\left[{A}^{\pi}\left(i,a\right)\right]-\underset{a\sim {\pi}^{\prime }}{E}\left[{A}^{\pi}\right(j,a\left)\right]\right|={\max}_{i,j}\left|\underset{a\sim {\pi}^{\prime }}{E}\left[{A}^{\pi}\left(i,a\right)\hbox{-} {A}^{\pi}\right(j,a\left)\right]\right| $$

(29)

by Theorem 1, Eq. (13) is obtained.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, Q., Xiao, N. A monotonic policy optimization algorithm for high-dimensional continuous control problem in 3D MuJoCo. Multimed Tools Appl 78, 28665–28680 (2019). https://doi.org/10.1007/s11042-018-6098-y

Download citation

Received: 25 January 2018
Revised: 13 March 2018
Accepted: 03 May 2018
Published: 04 June 2018
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11042-018-6098-y

Keywords

Access this article

Log in via an institution

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

A monotonic policy optimization algorithm for high-dimensional continuous control problem in 3D MuJoCo

Abstract

Access this article

Similar content being viewed by others

Path Planning in Unknown Environment Based on Reinforcement Learning

State-Dependent Maximum Entropy Reinforcement Learning for Robot Long-Horizon Task Learning

Closed-loop control dynamic obstacle avoidance algorithm based on a machine learning objective function

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords