Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A monotonic policy optimization algorithm for high-dimensional continuous control problem in 3D MuJoCo

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript


One challenge in applying reinforcement learning with nonlinear function approximator to high- dimensional continuous control problems is that the update policy produced by the many existed algorithms may fail to improve policy performance or even causes a serious degradation of the policy performance. To address this challenge, this paper proposes a new lower bound on the policy improvement where an average policy divergence on state space is penalized. To the best of our knowledge, this is currently the best result about the lower bound on the policy improvement. Optimizing directly the lower bound on the policy improvement is very difficult, because it demands for high computational overhead. According to the ideal of the trust region policy optimization (TRPO), this paper also presents a monotonic policy optimization algorithm, which is based on the new lower bound on the policy improvement introduced in this paper, it can generate a sequence of monotonically improving policies, and it is suitable for the large-scale continuous control problems. This paper also evaluates and compares the proposed algorithms with some of the existed algorithms on highly challenging robot locomotion tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others


  1. Achiam J (2016) Easy monotonic policy iteration [J]. arXiv:1602. 09118

  2. Duan Y, Chen X, Houthooft R, et al. (2016) Benchmarking deep reinforcement learning for continuous control [J]. Proceedings of The 33rd International Conference on Machine Learning, p 1329–1338

  3. Haviv M, Van Der Heyden, L (1984) Perturbation bounds for the stationary probabilities of a finite markov chain. Adv Appl Probab 16(4):804–818. ISSN 00018678. URL http://www.jstor.org/stable/142734

  4. Kakade, Sham (2001a) A natural policy gradient. In: NIPS, volume 14, p 1531–1538

  5. Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. Nineteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., p 267–274

  6. Krizhevsky A, Sutskever I, Hinton GE (2012) Image net classification with deep convolutional neural networks. In: Advances in neural information processing systems. p 1097–1105

  7. Martens J, Grosse R (2015) Optimizing Neural Networks with Kronecker-factored Approximate Curvature [J]. Proceedings of The 32nd International Conference on Machine Learning, p 2408–2417

  8. Martens J. Sutskever I (2012) Training deep and recurrent networks with hessian-free optimization. In: Neural networks: tricks of the trade. Springer, p 479–535

  9. Mnih V, Kavukcuoglu K, Silver D et al (2015) Human-level control through deep reinforcement learning. [J]. Nature 518(7540):529–533

    Article  Google Scholar 

  10. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning Proceedings of The 33rd International Conference on Machine Learning, p 1928–1937

  11. Ng AY, Harada D, Russell SJ (1999) Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping [C]. Sixteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. p 278–287

  12. Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7):1180–1190

    Article  Google Scholar 

  13. Pirotta M, Restelli M, Pecorino A, et al (2013) Safe policy iteration [C]. International Conference on Machine Learning. p 307–315

  14. Schulman J, Levine S, Moritz P, et al. (2015a) Trust region policy optimization [J]. Comput Sci:1889–1897

  15. Schulman J, Moritz P, Levine S, et al (2015b) High-dimensional continuous control using generalized advantage estimation [J]. arXiv preprint arXiv:1506. 02438

  16. Silver D, Lever G, Heess N, et al (2014) Deterministic policy gradient algorithms [C]. International Conference on Machine Learning p 387–395

  17. Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press

  18. Thomas P, Theocharous G, Ghavamzadeh M (2015) High confidence policy improvement [J]. Proceedings of The 32nd International Conference on Machine Learning p 2380–2388

  19. Thomas PS, Theocharous G, Ghavamzadeh M (2015) High-confidence off-policy evaluation. In AAAI p3000–3006

  20. Todorov E, Erez T, Tassa Y (2012) MuJoCo: a physics engine for model-based control [C]. Proceedings of the. IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE/RSJ International Conference on Intelligent Robots and Systems. p 5026–5033

  21. Van Hasselt H, Guez A, Silver D (2015) Deep reinforcement learning with double Q-learning [J]. CoRR, abs/1509.06461

  22. Wang Z, de Freitas N, Lanctot M (2016) Dueling network architectures for deep reinforcement learning Proceedings of The 33rd International Conference on Machine Learning, p 1995–2003

  23. Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3):229–256

    MathSciNet  MATH  Google Scholar 

  24. Ye Y (2011) The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a fixed discount rate. Math Oper Res 36(4):593–603

    Article  MathSciNet  MATH  Google Scholar 

Download references


This research is funded by the National Natural Science Foundation (Project No. 61573145), the Public Research and Capacity Building of Guangdong Province (Project No. 2014B010104001) and the Basic and Applied Basic Research of Guangdong Province (Project No. 2015A03030 8018), the authors are greatly thanks to these grants.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Qunyong Yuan.



  1. A.

    Proof of Theorem 1

Proof: for convenience, define

$$ {\delta}_f\left(s,a,{s}^{\prime}\right)=R\left(s,a\right)+\gamma f(s)-f\left({s}^{\prime}\right),\kern0.5em {\overline{\delta}}_f^{\pi^{\prime }}(s)=\underset{a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left({\delta}_f\left(s,a,{s}^{\prime}\right)\right)\left[s\right], $$

where, s’ is next state given that the agent took action a in state s, by Lemma 1, the identity is obtained

$$ {J}_{\mu}^{\pi^{\prime }}-{J}_{\mu}^{\pi }=\frac{1}{1\hbox{-} \gamma}\left(\underset{s\sim {d}_{\mu}^{\pi^{\prime }},a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left[{\delta}_f\left(s,a,{s}^{\prime}\right)\right]-\underset{s\sim {d}_{\mu}^{\pi },a\sim \pi, {s}^{\prime}\sim P}{E}\left[{\delta}_f\right(s,a,{s}^{\prime}\left)\right]\right) $$

\( \underset{s\sim {d}_{\mu}^{\pi^{\prime }},a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left[{\delta}_f\left(s,a,{s}^{\prime}\right)\right] \) is rewritten in the inner product form as following

$$ {\displaystyle \begin{array}{l}\underset{s\sim {d}_{\mu}^{\pi^{\prime }},a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left({\delta}_f\left(s,a,{s}^{\prime}\right)\right)=\left\langle {d}_{\mu}^{\pi^{\prime }},{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle =\left\langle {d}_{\mu}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle +\left\langle {d}_{\mu}^{\pi^{\prime }}-{d}_{\mu}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle \\ {}\ge \left\langle {d}_{\mu}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle -{\left\Vert {d}_{\mu}^{\pi^{\prime }}-{d}_{\mu}^{\pi}\right\Vert}_1\frac{\Delta {\overline{\delta}}_f^{\pi^{\prime }}}{2}\kern0.5em \left(\mathrm{by}\ \mathrm{Lemma}\ 3\right)\end{array}} $$

Different inequalities applied for dealing with the inner product \( \left\langle {d}^{\pi^{\prime }}-{d}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle \) may lead to different lower bounds, such as Holder’s inequality [1]. Here we use the Lemma 3. To our knowledge, the lower bound here derived by us is better than the others. By the importance sampling, the identity can be rewritten as follows

$$ \left\langle {d}_{\mu}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle =\underset{s\sim {d}_{\mu}^{\pi^{\prime }},a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left[{\delta}_f\left(s,a,{s}^{\prime}\right)\right]=\underset{s\sim {d}_{\mu}^{\pi },a\sim \pi, {s}^{\prime}\sim P}{E}\left[\frac{\pi^{\prime}\left(a|s\right)}{\pi \left(a|s\right)}\left(R\left(s,a\right)+\gamma f\left({s}^{\prime}\right)-f(s)\right)\right] $$

Bring all terms together, the lower bound of the policy improvement can be obtained.

  1. B.

    Proof of Corollary 1

Proof: let

$$ f(s)={V}^{\pi }(s),\kern0.5em {\delta}_f\left(s,a,{s}^{\prime}\right)={\delta}_V\left(s,a,{s}^{\prime}\right)=R\left(s,a\right)+\gamma {V}^{\pi}\left({s}^{\prime}\right)-{V}^{\pi }(s) $$

(TD -error), obviously,

$$ \underset{s\sim {d}_{\mu}^{\pi },a\sim \pi, {s}^{\prime}\sim P}{E}\left[{\delta}_{V^{\pi }}\left(s,a,{s}^{\prime}\right)\right]=\underset{s\sim {d}_{\mu}^{\pi }}{E}\left[{A}^{\pi}\left(s,a\right)\right]=0, $$
$$ \left\langle {d}_{\mu}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle =\underset{s\sim {d}_{\mu}^{\pi },a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left[{\delta}_f\left(s,a,{s}^{\prime}\right)\right]=\underset{s\sim {d}_{\mu}^{\pi },a\sim {\pi}^{\prime }}{E}\left[\frac{\pi^{\prime}\left(a,s\right)}{\pi \left(a,s\right)}{A}^{\pi}\left(s,a\right)\right] $$
$$ \Delta {\overline{\delta}}_V^{\pi^{\prime }}={\max}_{i,j}\left|{\overline{\delta}}_V^{\pi^{\prime }}(i)-{\overline{\delta}}_V^{\pi^{\prime }}(j)\right|={\max}_{i,j}\left|\underset{a\sim {\pi}^{\prime }}{E}\left[{A}^{\pi}\left(i,a\right)\right]-\underset{a\sim {\pi}^{\prime }}{E}\left[{A}^{\pi}\right(j,a\left)\right]\right|={\max}_{i,j}\left|\underset{a\sim {\pi}^{\prime }}{E}\left[{A}^{\pi}\left(i,a\right)\hbox{-} {A}^{\pi}\right(j,a\left)\right]\right| $$

by Theorem 1, Eq. (13) is obtained.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuan, Q., Xiao, N. A monotonic policy optimization algorithm for high-dimensional continuous control problem in 3D MuJoCo. Multimed Tools Appl 78, 28665–28680 (2019). https://doi.org/10.1007/s11042-018-6098-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6098-y
