Document Zbl 07892509

Fully cooperative games with state and input constraints using reinforcement learning based on control barrier functions. (English) Zbl 07892509

Asian J. Control 26, No. 2, 888-905 (2024).

Summary: This paper provides a novel safe reinforcement learning (RL) control algorithm to solve safe optimal problems for fully cooperative (FC) games of discrete-time multiplayer nonlinear systems with state and input constraints. The FC game is a special case of nonzero-sum (NZS) games, where all players cooperate to accomplish a common task. The algorithm is proposed based on the policy iteration (PI) framework utilizing only the measured data along the system trajectories in the environment. Different from most works about PI, an effective method of obtaining initial safe and stable control policies is given here. In addition, control barrier functions (CBFs) and an input constraint function are introduced to augment reward functions. And the monotonically nonincreasing property of the iterative value function in the PI algorithm maintains the safe set forward invariant. Then, the neural networks are employed to approximate the system dynamics, the iterative control policies, and the iterative value function, respectively. Furthermore, the proposed algorithm is supported by theoretical proofs that guarantee both safety and convergence. Finally, the effectiveness and safety of the algorithm are illustrated by the results of the simulation.
© 2023 Chinese Automatic Control Society and John Wiley & Sons Australia, Ltd

MSC:

93-XX

Systems theory; control

Keywords:

control barrier function; fully cooperative games; policy iteration; reinforcement learning; state and input constraints

Cite Review PDF

Full Text: DOI

References:

[1]	R. S.Sutton and A. G.Barto, Reinforcement learning: an introduction, MIT press, 2018. · Zbl 1407.68009
[2]	G.Kahn, A.Villaflor, B.Ding, P.Abbeel, and S.Levine, Self‐supervised deep reinforcement learning with generalized computation graphs for robot navigation, 2018 Ieee International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 5129-5136.
[3]	Q.Zheng, C.Jin, Z.Hu, and H.Zhang, A study of aero‐engine control method based on deep reinforcement learning, IEEE Access7 (2019), 55285-55289.
[4]	P. N.Dao and Y.‐C.Liu, Adaptive reinforcement learning in control design for cooperating manipulator systems, Asian J. Control24 (2022), no. 3, 1088-1103. · Zbl 07887037
[5]	J.Mi, N.Kuze, and T.Ushio, A mobile robot controller using reinforcement learning under SCLTL specifications with uncertainties, Asian J. Control24 (2022), no. 6, 2916-2930. · Zbl 07887187
[6]	J.Li, T.Chai, F. L.Lewis, Z.Ding, and Y.Jiang, Off‐policy interleaved [( q \]\)‐learning: optimal control for affine nonlinear discrete‐time systems, IEEE Trans. Neural Netw. Learn. Syst.30 (2018), no. 5, 1308-1320.
[7]	C.Mu, D.Wang, and H.He, Novel iterative neural dynamic programming for data‐based approximate optimal control design, Automatica81 (2017), 240-252. · Zbl 1373.90170
[8]	A. M.Zaki, A. M.El‐Nagar, M.El‐Bardini, and F. A. S.Soliman, Deep learning controller for nonlinear system based on Lyapunov stability criterion, Neural Comput. Appl.33 (2021), 1515-1531.
[9]	Q.Wei, L.Zhu, R.Song, P.Zhang, D.Liu, and J.Xiao, Model‐free adaptive optimal control for unknown nonlinear multiplayer nonzero‐sum game, IEEE Trans. Neural Netw. Learn. Syst.33 (2020), no. 2, 879-892.
[10]	J.Li, J.Ding, T.Chai, and F. L.Lewis, Nonzero‐sum game reinforcement learning for performance optimization in large‐scale industrial processes, IEEE Trans. Cybern.50 (2019), no. 9, 4132-4145.
[11]	Z.Khan, S.Glisic, L. A.DaSilva, and J.Lehtomäki, Modeling the dynamics of coalition formation games for cooperative spectrum sharing in an interference channel, IEEE Trans. Comput. Intell. AI Games3 (2010), no. 1, 17-30.
[12]	F. M.Zedan, A. M.Al‐Shehri, S. Z.Zakhary, M. H.Al‐Anazi, A. S.Al‐Mozan, and Z. R.Al‐Zaid, A nonzero sum approach to interactive electricity consumption, IEEE Trans. Power Deliv.25 (2009), no. 1, 66-71.
[13]	J.Kim, Cooperative localization and control of multiple heterogeneous robots using a string formation, Asian J. Control25 (2023), no. 2, 794-806. · Zbl 07889038
[14]	Q.Ouyang, Z.Wu, Y.Cong, and Z.Wang, Formation control of unmanned aerial vehicle swarms: a comprehensive review, Asian J. Control25 (2023), no. 1, 570-593. · Zbl 07889018
[15]	H.Jiang, H.Zhang, K.Zhang, and X.Cui, Data‐driven adaptive dynamic programming schemes for non‐zero‐sum games of unknown discrete‐time nonlinear systems, Neurocomputing275 (2018), 649-658.
[16]	Y.Yang, K. G.Vamvoudakis, and H.Modares, Safe reinforcement learning for dynamical games, Int. J. Robust Nonlinear Control30 (2020), no. 9, 3706-3726. · Zbl 1466.91038
[17]	P.Liu, H.Zhang, H.Ren, and C.Liu, Online event‐triggered adaptive critic design for multi‐player zero‐sum games of partially unknown nonlinear systems with input constraints, Neurocomputing462 (2021), 309-319.
[18]	B.Luo, Y.Yang, and D.Liu, Policy iteration q‐learning for data‐based two‐player zero‐sum game of linear discrete‐time systems, IEEE Trans. Cybern.51 (2020), no. 7, 3630-3640.
[19]	Q.Wei, D.Liu, Q.Lin, and R.Song, Adaptive dynamic programming for discrete‐time zero‐sum games, IEEE Trans. Neural Netw. Learn. Syst.29 (2017), no. 4, 957-969.
[20]	H.‐Y.Jiang, B.Zhou, and G.‐P.Liu, [( {H}_{\infty } \]\) optimal control of unknown linear systems by adaptive dynamic programming with applications to time‐delay systems, Int. J. Robust Nonlinear Control31 (2021), no. 12, 5602-5617. · Zbl 1525.93065
[21]	J.Wang and X.Chen, ([H \infty \]\) consensus for Markov jump multi‐agent systems with partly unknown transition probabilities and multiplicative noise, Asian J. Control25 (2023), no. 2, 1653-1662. · Zbl 07889099
[22]	Y.Wang, Y.Han, and C.Gao, Robust h [( \infty \]\) sliding mode control for uncertain discrete singular t‐s fuzzy Markov jump systems, Asian J. Control25 (2023), no. 1, 524-536. · Zbl 07889014
[23]	X.Liu, R.Liu, and Y.Li, Infinite time linear quadratic Stackelberg game problem for unknown stochastic discrete‐time systems via adaptive dynamic programming approach, Asian J. Control23 (2021), no. 2, 937-948. · Zbl 07878861
[24]	A. D.Robles‐Aguilar, D.González‐Sánchez, and J. A.Minjárez‐Sosa, Empirical approximation of Nash equilibria in finite Markov games with discounted payoffs, Asian J. Control25 (2023), no. 2, 722-734. · Zbl 07889032
[25]	H.Jiang, H.Zhang, X.Xie, and J.Han, Neural‐network‐based learning algorithms for cooperative games of discrete‐time multi‐player systems with control constraints via adaptive dynamic programming, Neurocomputing344 (2019), 13-19.
[26]	Q.Zhang, D.Zhao, and Y.Zhu, Data‐driven adaptive dynamic programming for continuous‐time fully cooperative games with partially constrained inputs, Neurocomputing238 (2017), 377-386.
[27]	Q.Zhang, D.Zhao, and F. L.Lewis, Model‐free reinforcement learning for fully cooperative multi‐agent graphical games, 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 2018, pp. 1-6.
[28]	H.Wang and M.Li, Model‐free reinforcement learning for fully cooperative consensus problem of nonlinear multiagent systems, IEEE Trans. Neural Netw. Learn. Syst.33 (2022), no. 4, 1482-1491.
[29]	D.Liu and Q.Wei, Policy iteration adaptive dynamic programming algorithm for discrete‐time nonlinear systems, IEEE Trans. Neural Netw. Learn. Syst.25 (2013), no. 3, 621-634.
[30]	D. Q.Mayne, J. B.Rawlings, C. V.Rao, and P. O. M.Scokaert, Constrained model predictive control: stability and optimality, Automatica36 (2000), no. 6, 789-814. · Zbl 0949.93003
[31]	G.Wu and K.Sreenath, Safety‐critical control of a planar quadrotor, 2016 American Control Conference (ACC). IEEE, 2016, pp. 2252-2258.
[32]	T.Gurriet, A.Singletary, J.Reher, L.Ciarletta, E.Feron, and A.Ames, Towards a framework for realizable safety critical control through active set invariance, 2018 ACM/IEEE 9th International Conference on Cyber‐Physical Systems (ICCPS). IEEE, 2018, pp. 98-106.
[33]	Z.Marvi and B.Kiumarsi, Safe reinforcement learning: a control barrier function optimization approach, Int. J. Robust Nonlinear Control31 (2021), no. 6, 1923-1940. · Zbl 1526.93181
[34]	J.Xu, J.Wang, J.Rao, Y.Zhong, and H.Wang, Adaptive dynamic programming for optimal control of discrete‐time nonlinear system with state constraints based on control barrier function, Int. J. Robust Nonlinear Control32 (2022), no. 6, 3408-3424. · Zbl 1527.93233
[35]	R.Cheng, G.Orosz, R. M.Murray, and J. W.Burdick, End‐to‐end safe reinforcement learning through barrier functions for safety‐critical continuous control tasks, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 2019, pp. 3387-3395.
[36]	Y.Yang, K. G.Vamvoudakis, H.Modares, Y.Yin, and D. C.Wunsch, Safe intermittent reinforcement learning with static and dynamic event generators, IEEE Trans. Neural Netw. Learn. Syst.31 (2020), no. 12, 5441-5455.
[37]	Y.Xiong, D.‐H.Zhai, M.Tavakoli, and Y.Xia, Discrete‐time control barrier function: high‐order case and adaptive case, IEEE Trans. Cybern.53 (2023), no. 5, 3231-3239.
[38]	A. D.Ames, X.Xu, J. W.Grizzle, and P.Tabuada, Control barrier function based quadratic programs for safety critical systems, IEEE Trans. Autom. Control62 (2016), no. 8, 3861-3876. · Zbl 1373.90092
[39]	A.Agrawal and K.Sreenath, Discrete control barrier functions for safety‐critical control of discrete systems with application to bipedal robot navigation., Robot Sci. Syst., Vol. 13. Cambridge, MA, USA, 2017.
[40]	X.Guo, W.Yan, and R.Cui, Reinforcement learning‐based nearly optimal control for constrained‐input partially unknown systems using differentiator, IEEE Trans. Neural Netw. Learn. Syst.31 (2019), no. 11, 4713-4725.
[41]	B.Kiumarsi and F. L.Lewis, Actor-critic‐based optimal tracking for partially unknown nonlinear discrete‐time systems, IEEE Trans. Neural Netw. Learn. Syst.26 (2014), no. 1, 140-151.
[42]	C.Li, J.Ding, F. L.Lewis, and T.Chai, A novel adaptive dynamic programming based on tracking error for nonlinear discrete‐time systems, Automatica129 (2021), 109687. · Zbl 1478.93321
[43]	M.Liang and Q.Wei, A partial policy iteration ADP algorithm for nonlinear neuro‐optimal control with discounted total reward, Neurocomputing424 (2021), 23-34.
[44]	Z.Shi and Z.Wang, Adaptive output‐feedback optimal control for continuous‐time linear systems based on adaptive dynamic programming approach, Neurocomputing438 (2021), 334-344.
[45]	J. A. E.Andersson, J.Gillis, G.Horn, J. B.Rawlings, and M.Diehl, CasADi: a software framework for nonlinear optimization and optimal control, Math. Program. Comput.11 (2019), 1-36. · Zbl 1411.90004
[46]	Q.Zhao, H.Xu, and S.Jagannathan, Neural network‐based finite‐horizon optimal control of uncertain affine nonlinear discrete‐time systems,IEEE Trans. Neural Netw. Learn. Syst.26 (2014), no. 3, 486-499.
[47]	S.Ha, P.Xu, Z.Tan, S.Levine, and J.Tan, Learning to walk in the real world with minimal human effort, Conference on Robot Learning. PMLR, 2021, pp. 1110-1120.
[48]	T. P.Lillicrap, J. J.Hunt, A.Pritzel, N.Heess, T.Erez, Y.Tassa, D.Silver, and D.Wierstra, Continuous control with deep reinforcement learning, 2015. arXiv preprint arXiv:1509.02971.

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.