Abstract
One challenge in applying reinforcement learning with nonlinear function approximator to high- dimensional continuous control problems is that the update policy produced by the many existed algorithms may fail to improve policy performance or even causes a serious degradation of the policy performance. To address this challenge, this paper proposes a new lower bound on the policy improvement where an average policy divergence on state space is penalized. To the best of our knowledge, this is currently the best result about the lower bound on the policy improvement. Optimizing directly the lower bound on the policy improvement is very difficult, because it demands for high computational overhead. According to the ideal of the trust region policy optimization (TRPO), this paper also presents a monotonic policy optimization algorithm, which is based on the new lower bound on the policy improvement introduced in this paper, it can generate a sequence of monotonically improving policies, and it is suitable for the large-scale continuous control problems. This paper also evaluates and compares the proposed algorithms with some of the existed algorithms on highly challenging robot locomotion tasks.
Similar content being viewed by others
References
Achiam J (2016) Easy monotonic policy iteration [J]. arXiv:1602. 09118
Duan Y, Chen X, Houthooft R, et al. (2016) Benchmarking deep reinforcement learning for continuous control [J]. Proceedings of The 33rd International Conference on Machine Learning, p 1329–1338
Haviv M, Van Der Heyden, L (1984) Perturbation bounds for the stationary probabilities of a finite markov chain. Adv Appl Probab 16(4):804–818. ISSN 00018678. URL http://www.jstor.org/stable/142734
Kakade, Sham (2001a) A natural policy gradient. In: NIPS, volume 14, p 1531–1538
Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. Nineteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., p 267–274
Krizhevsky A, Sutskever I, Hinton GE (2012) Image net classification with deep convolutional neural networks. In: Advances in neural information processing systems. p 1097–1105
Martens J, Grosse R (2015) Optimizing Neural Networks with Kronecker-factored Approximate Curvature [J]. Proceedings of The 32nd International Conference on Machine Learning, p 2408–2417
Martens J. Sutskever I (2012) Training deep and recurrent networks with hessian-free optimization. In: Neural networks: tricks of the trade. Springer, p 479–535
Mnih V, Kavukcuoglu K, Silver D et al (2015) Human-level control through deep reinforcement learning. [J]. Nature 518(7540):529–533
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning Proceedings of The 33rd International Conference on Machine Learning, p 1928–1937
Ng AY, Harada D, Russell SJ (1999) Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping [C]. Sixteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. p 278–287
Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7):1180–1190
Pirotta M, Restelli M, Pecorino A, et al (2013) Safe policy iteration [C]. International Conference on Machine Learning. p 307–315
Schulman J, Levine S, Moritz P, et al. (2015a) Trust region policy optimization [J]. Comput Sci:1889–1897
Schulman J, Moritz P, Levine S, et al (2015b) High-dimensional continuous control using generalized advantage estimation [J]. arXiv preprint arXiv:1506. 02438
Silver D, Lever G, Heess N, et al (2014) Deterministic policy gradient algorithms [C]. International Conference on Machine Learning p 387–395
Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press
Thomas P, Theocharous G, Ghavamzadeh M (2015) High confidence policy improvement [J]. Proceedings of The 32nd International Conference on Machine Learning p 2380–2388
Thomas PS, Theocharous G, Ghavamzadeh M (2015) High-confidence off-policy evaluation. In AAAI p3000–3006
Todorov E, Erez T, Tassa Y (2012) MuJoCo: a physics engine for model-based control [C]. Proceedings of the. IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE/RSJ International Conference on Intelligent Robots and Systems. p 5026–5033
Van Hasselt H, Guez A, Silver D (2015) Deep reinforcement learning with double Q-learning [J]. CoRR, abs/1509.06461
Wang Z, de Freitas N, Lanctot M (2016) Dueling network architectures for deep reinforcement learning Proceedings of The 33rd International Conference on Machine Learning, p 1995–2003
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3):229–256
Ye Y (2011) The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a fixed discount rate. Math Oper Res 36(4):593–603
Acknowledgments
This research is funded by the National Natural Science Foundation (Project No. 61573145), the Public Research and Capacity Building of Guangdong Province (Project No. 2014B010104001) and the Basic and Applied Basic Research of Guangdong Province (Project No. 2015A03030 8018), the authors are greatly thanks to these grants.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
-
A.
Proof of Theorem 1
Proof: for convenience, define
where, s’ is next state given that the agent took action a in state s, by Lemma 1, the identity is obtained
\( \underset{s\sim {d}_{\mu}^{\pi^{\prime }},a\sim {\pi}^{\prime },{s}^{\prime}\sim P}{E}\left[{\delta}_f\left(s,a,{s}^{\prime}\right)\right] \) is rewritten in the inner product form as following
Different inequalities applied for dealing with the inner product \( \left\langle {d}^{\pi^{\prime }}-{d}^{\pi },{\overline{\delta}}_f^{\pi^{\prime }}\right\rangle \) may lead to different lower bounds, such as Holder’s inequality [1]. Here we use the Lemma 3. To our knowledge, the lower bound here derived by us is better than the others. By the importance sampling, the identity can be rewritten as follows
Bring all terms together, the lower bound of the policy improvement can be obtained.
-
B.
Proof of Corollary 1
Proof: let
(TD -error), obviously,
by Theorem 1, Eq. (13) is obtained.
Rights and permissions
About this article
Cite this article
Yuan, Q., Xiao, N. A monotonic policy optimization algorithm for high-dimensional continuous control problem in 3D MuJoCo. Multimed Tools Appl 78, 28665–28680 (2019). https://doi.org/10.1007/s11042-018-6098-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6098-y