Policy gradient adaptive dynamic programming for nonlinear discrete-time zero-sum games with unknown dynamics

Lin, Mingduo; Zhao, Bo; Liu, Derong

doi:10.1007/s00500-023-07817-6

Policy gradient adaptive dynamic programming for nonlinear discrete-time zero-sum games with unknown dynamics

Optimization
Published: 13 January 2023

Volume 27, pages 5781–5795, (2023)
Cite this article

Soft Computing Aims and scope Submit manuscript

409 Accesses
2 Citations
Explore all metrics

Abstract

A novel policy gradient (PG) adaptive dynamic programming method is developed to deal with nonlinear discrete-time zero-sum games with unknown dynamics. To facilitate the implementation, a policy iteration algorithm is established to approximate the iterative Q-function, as well as the control and disturbance policies via three neural network (NN) approximators, respectively. Then, the iterative Q-function is exploited to update the control and disturbance policies via PG method. To stabilize the training process and improve the data usage efficiency, the experience replay technique is applied to train the weight vectors of the three NNs by using mini-batch empirical data from replay memory. Furthermore, the convergence in terms of the iterative Q-function is proved. Simulation results of two numerical examples are provided to show the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integral Policy Iteration for Zero-Sum Games with Completely Unknown Nonlinear Dynamics

Off-Policy Integral Reinforcement Learning Method for Multi-player Non-zero-Sum Games

Iterative ADP learning algorithms for discrete-time multi-player games

Article 12 January 2018

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Data availability statement

Not applicable.

Code availability

Not applicable.

References

Abu-Khalaf M, Lewis FL, Huang J (2006) Policy iterations on the Hamilton–Jacobi–Isaacs equation for $H_\infty $ state feedback control with input saturation. IEEE Trans Autom Conrol 51(12):1989–1995
MathSciNet MATH Google Scholar
Adam S, Busoniu L, Babuska R (2012) Experience replay for real-time reinforcement learning control. IEEE Trans Syst Man Cybern C Appl Rev 42(2):201–212
Google Scholar
Al-Tamimi A, Abu-Khalaf M, Lewis FL (2007) Adaptive critic designs for discrete-time zero-sum games with application to $H_{\infty }$ control. IEEE Trans Syst Man Cybern B Cybern 37(1):240–247
MATH Google Scholar
Al-Tamimi A, Lewis FL, Abu-Khalaf M (2007) Model-free Q-learning designs for linear discrete-time zero-sum games with application to $H_\infty $ control. Automatica 43(3):473–481
MathSciNet MATH Google Scholar
Basar T, Bernhard P (1995) $H_\infty $ optimal control and related minimax design problems: a dynamic game approach. Birkhäuser, Boston
MATH Google Scholar
Bellman R (1957) Dynamic programming. Princeton University Press, Princeton
MATH Google Scholar
Bhatnagar S, Sutton RS, Ghavamzadeh M, Mark L (2009) Natural actor-critic algorithms. Automatica 45(11):2471–2482
MathSciNet MATH Google Scholar
Chakrabarty A, Jha DK, Buzzard GT, Wang Y, Vamvoudakis KG (2020) Safe approximate dynamic programming via kernelized lipschitz estimation. IEEE Trans Neural Netw Learn Syst 32(1):405–419
MathSciNet Google Scholar
Hou J, Wang D, Liu D, Zhang Y (2020) Model-free $H_\infty $ optimal tracking control of constrained nonlinear systems via an iterative adaptive learning algorithm. IEEE Trans Syst Man Cybern Syst 50(11):4097–4108
Google Scholar
Jiang H, Zhang H (2018) Iterative ADP learning algorithms for discrete-time multi-player games. Artif Intell Rev 50(1):75–91
Google Scholar
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. Artif Intell Rev 4:237–285
Google Scholar
Kiumarsi B, Lewis FL, Jiang ZP (2017) $H_\infty $ control of linear discrete-time systems: off-policy reinforcement learning. Automatica 78:144–152
MathSciNet MATH Google Scholar
Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. Adv Neural Inf Process Syst 13:1008–1014
MATH Google Scholar
Lewis FL, Syrmos VL (1995) Optimal control. Wiley, Hoboken
Google Scholar
Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn 8(3–4):293–321
Google Scholar
Lin M, Zhao B, Liu D (2022) Policy gradient adaptive critic designs for model-free optimal tracking control with experience replay. IEEE Trans Syst Man Cybern Syst 52(6):3692–3703
Google Scholar
Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems. IEEE Trans Cybern 43(2):779–789
Google Scholar
Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
Google Scholar
Liu D, Li H, Wang D (2013) Neural-network-based zero-sum game for discrete-time nonlinear systems via iterative adaptive dynamic programming algorithm. Neurocomputing 110(13):92–100
Google Scholar
Liu D, Wang D, Li H (2014) Decentralized stabilization for a class of continuous-time nonlinear interconnected systems using online learning optimal control approach. IEEE Trans Neural Netw Learn Syst 25(2):418–428
Google Scholar
Liu D, Wei Q, Wang D, Yang X, Li H (2017) Adaptive dynamic programming with applications in optimal control. Springer, Cham
MATH Google Scholar
Liu D, Xu Y, Wei Q, Liu X (2018) Residential energy scheduling for variable weather solar energy based on adaptive dynamic programming. IEEE/CAA J Autom Sinica 5(1):36–46
Google Scholar
Liu D, Xue S, Zhao B, Luo B, Wei Q (2020) Adaptive dynamic programming for control: a survey and recent advances. IEEE Trans Syst Man Cybern Syst 51(1):142–160
Google Scholar
Luo B, Wu HN, Huang T (2015) Off-policy reinforcement learning for $H_\infty $ control design. IEEE Trans Cybern 45(1):65–76
Google Scholar
Luo B, Liu D, Huang T, Wang D (2016) Model-free optimal tracking control via critic-only Q-learning. IEEE Trans Neural Netw Learn Syst 27(10):2134–2144
MathSciNet Google Scholar
Luo B, Liu D, Wu HN, Wang D, Lewis FL (2017) Policy gradient adaptive dynamic programming for data-based optimal control. IEEE Trans Cybern 47(10):3341–3354
Google Scholar
Luo B, Liu D, Wu HN (2018) Adaptive constrained optimal control design for data-based nonlinear discrete-time systems with critic-only structure. IEEE Trans Neural Netw Learn Syst 29(6):2099–2111
MathSciNet Google Scholar
Luo B, Yang Y, Liu D (2018) Adaptive Q-learning for data-based optimal output regulation with experience replay. IEEE Trans Cybern 48(12):3337–3348
Google Scholar
Mahmoud TA, Abdo MI, Elsheikh EA, Elshenawy LM (2021) Direct adaptive control for nonlinear systems using a TSK fuzzy echo state network based on fractional-order learning algorithm. J Frankl Inst 358(17):9034–9060
MathSciNet MATH Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Google Scholar
Modares H, Lewis FL, Sistani MN (2014) Online solution of nonquadratic two-player zero-sum games arising in the $H_\infty $ control of constrained input systems. Int J Adapt Control 28(3–5):232–254
MathSciNet MATH Google Scholar
Modares H, Lewis FL, Jiang Z (2015) $H_\infty $ tracking control of completely unknown continuous-time systems via off-policy reinforcement learning. IEEE Trans Neural Netw Learn Syst 26(10):2550–2562
MathSciNet Google Scholar
Ni Z, Naresh M, Zhong X (2018) Prioritizing useful experience replay for heuristic dynamic programming-based learning systems. IEEE Trans Cybern 49(11):3911–3922
Google Scholar
Song S, Zhu M, Dai X, Gong D (2022) Model-free optimal tracking control of nonlinear input-affine discrete-time systems via an iterative deterministic Q-learning algorithm. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3178746
Article Google Scholar
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
MATH Google Scholar
Wei Q, Liu D, Shi G (2015) A novel dual iterative Q-learning method for optimal battery management in smart residential environments. IEEE Trans Ind Electron 62(4):2509–2518
Google Scholar
Wei Q, Liu D, Xu Y (2016) Neuro-optimal tracking control for a class of discrete-time nonlinear systems via generalized value iteration adaptive dynamic programming approach. Soft Comput 20(2):697–706
MATH Google Scholar
Wei Q, Liu D, Liu Y, Song R (2017) Optimal constrained self-learning battery sequential management in microgrid via adaptive dynamic programming. IEEE/CAA J Autom Sinica 4(2):168–176
MathSciNet Google Scholar
Xue S, Luo B, Liu D (2020) Event-triggered adaptive dynamic programming for zero-sum game of partially unknown continuous-time nonlinear systems. IEEE Trans Syst Man Cybern Syst 50(9):3189–3199
Google Scholar
Xue S, Luo B, Liu D (2022) Constrained event-triggered $H_\infty $ control based on adaptive dynamic programming with concurrent learning. IEEE Trans Syst Man Cybern Syst 52(1):357–369
Google Scholar
Yang X, He H (2020) Adaptive critic learning and experience replay for decentralized event-triggered control of nonlinear interconnected systems. IEEE Trans Syst Man Cybern Syst 50(11):4043–4055
Google Scholar
Yang Y, Zhu H, Zhang Q, Zhao B, Li Z, Wunsch D (2022) Sparse online kernelized actor-critic learning in reproducing kernel Hilbert space. Artif Intell Rev 55:23–58
Google Scholar
Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206–216
Google Scholar
Zhang Y, Zhao B, Liu D (2021) Event-triggered adaptive dynamic programming for multi-player zero-sum games with unknown dynamics. Soft Comput 25(3):2237–2251
MATH Google Scholar
Zhang Y, Zhao B, Liu D, Zhang S (2022) Event-triggered control of discrete-time zero-sum games via deterministic policy gradient adaptive dynamic programming. IEEE Trans Syst Man Cybern Syst 52(8):4823–4835
Google Scholar
Zhao B, Liu D (2020) Event-triggered decentralized tracking control of modular reconfigurable robots through adaptive dynamic programming. IEEE Trans Ind Electron 67(4):3054–3064
Google Scholar
Zhao D, Zhang Q, Wang D (2016) Experience replay for optimal control of nonzero-sum game systems with unknown dynamics. IEEE Trans Cybern 46(3):854–865
Google Scholar
Zhao B, Wang D, Shi G, Liu D, Li Y (2018) Decentralized control for large-scale nonlinear systems with unknown mismatched interconnections via policy iteration. IEEE Trans Syst Man Cybern Syst 48(10):1725–1735
Google Scholar
Zhao B, Liu D, Alippi C (2021) Sliding-mode surface-based approximate optimal control for uncertain nonlinear systems with asymptotically stable critic structure. IEEE Trans Cybern 51(6):2858–2869
Google Scholar
Zhao B, Zhang Y, Liu D (2022) Adaptive dynamic programming-based cooperative motion/force control for modular reconfigurable manipulators: a joint task assignment approach. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3171828
Article Google Scholar
Zhong X, He H, Wang D, Ni Z (2018) Model-free adaptive control for unknown nonlinear zero-sum differential game. IEEE Trans Cybern 48(5):1633–1646
Google Scholar
Zhu Y, Zhao D (2018) Comprehensive comparison of online ADP algorithms for continuous-time optimal control. Artif Intell Rev 49(4):531–547

Download references

Funding

This work was supported in part by the Beijing Natural Science Foundation under Grant 4212038, in part by the National Natural Science Foundation of China under Grants 61973330 and 62073085, in part by the Beijing Normal University Tang Scholar, in part by the Open Research Project of the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, under Grant 20210108 and in part by the Open Research Project of the Key Laboratory of Industrial Internet of Things & Networked Control, Ministry of Education under Grant 2021FF10.

Author information

Authors and Affiliations

School of Systems Science, Beijing Normal University, Beijing, 100875, China
Mingduo Lin & Bo Zhao
Key Laboratory of Industrial Internet of Things & Networked Control, Ministry of Education, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
Bo Zhao
Department of Mechanical and Energy Engineering, Southern University of Science and Technology, Shenzhen, 518055, China
Derong Liu
Department of Electrical and Computer Engineering, University of Illinois Chicago, Chicago, IL, 60607, USA
Derong Liu

Authors

Mingduo Lin
View author publications
You can also search for this author in PubMed Google Scholar
Bo Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Derong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Mingduo Lin helped in methodology, writing, editing. Bo Zhao contributed to supervision, editing, review. Derong Liu was involved in supervision and review.

Corresponding author

Correspondence to Bo Zhao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All the authors have approved the manuscript for publication, and there is no conflict of interest exists. On behalf of my co-authors, I declare that the work is original research that has not been published previously, and not under consideration for other publications in whole or in part. This manuscript does not contain any studies with human participants and animals performed by any of the authors. Informed consent was obtained from all individual participants included in the manuscript.

Consent to participate

All the authors consent to participate this work.

Consent for publication

All the authors consent to publish this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lin, M., Zhao, B. & Liu, D. Policy gradient adaptive dynamic programming for nonlinear discrete-time zero-sum games with unknown dynamics. Soft Comput 27, 5781–5795 (2023). https://doi.org/10.1007/s00500-023-07817-6

Download citation

Accepted: 01 January 2023
Published: 13 January 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00500-023-07817-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Policy gradient adaptive dynamic programming for nonlinear discrete-time zero-sum games with unknown dynamics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Integral Policy Iteration for Zero-Sum Games with Completely Unknown Nonlinear Dynamics

Off-Policy Integral Reinforcement Learning Method for Multi-player Non-zero-Sum Games

Iterative ADP learning algorithms for discrete-time multi-player games

Data availability statement

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Policy gradient adaptive dynamic programming for nonlinear discrete-time zero-sum games with unknown dynamics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Integral Policy Iteration for Zero-Sum Games with Completely Unknown Nonlinear Dynamics

Off-Policy Integral Reinforcement Learning Method for Multi-player Non-zero-Sum Games

Iterative ADP learning algorithms for discrete-time multi-player games

Explore related subjects

Data availability statement

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation