×

A projected primal-dual gradient optimal control method for deep reinforcement learning. (English) Zbl 1472.49042

Summary: In this contribution, we start with a policy-based Reinforcement Learning ansatz using neural networks. The underlying Markov Decision Process consists of a transition probability representing the dynamical system and a policy realized by a neural network mapping the current state to parameters of a distribution. Therefrom, the next control can be sampled. In this setting, the neural network is replaced by an ODE, which is based on a recently discussed interpretation of neural networks. The resulting infinite optimization problem is transformed into an optimization problem similar to the well-known optimal control problems. Afterwards, the necessary optimality conditions are established and from this a new numerical algorithm is derived. The operating principle is shown with two examples. It is applied to a simple example, where a moving point is steered through an obstacle course to a desired end position in a 2D plane. The second example shows the applicability to more complex problems. There, the aim is to control the finger tip of a human arm model with five degrees of freedom and 29 Hill’s muscle models to a desired end position.

MSC:

49K15 Optimality conditions for problems involving ordinary differential equations
90C40 Markov and semi-Markov decision processes
93E35 Stochastic learning and adaptive control
60J20 Applications of Markov chains and discrete-time Markov processes on general state spaces (social mobility, learning theory, industrial processes, etc.)
68Q06 Networks and circuits as models of computation; circuit complexity

References:

[1] Pontryagin, LS; Boltyanskii, VG; Gamkrelidze, RV; Mishchenko, EF, The mathematical theory of optimal processes (1962), New York: Wiley, New York · Zbl 0102.32001
[2] Sussmann, HJ; Willems, JC, 300 years of optimal control: from the brachystochrone to the maximum principle, IEEE Control Syst. Mag., 17, 32-44 (1997) · Zbl 1014.49001 · doi:10.1109/37.588098
[3] Sutton, RS; Barto, AG, Reinforcement learning: an introduction (2018), Cambridge: MIT Press, Cambridge · Zbl 1407.68009
[4] Bertsekas D. Reinforcement learning and optimal control. Athena Scientific; 2019.
[5] Schulman, J.; Levine, S.; Abbeel, P.; Jordan, MI; Moritz, P., Trust region policy optimization, International conference on machine learning, JMLR, 1889-1897 (2015)
[6] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. CoRR 2017. arXiv:1707.06347.
[7] Haber, E.; Ruthotto, L., Stable architectures for deep neural networks, Inverse problems (2017), Bristol: IOP Publishing, Bristol
[8] Benning M, Celledoni E, Ehrhardt MJ, Owren B, Schönlieb C. Deep learning as optimal control problems: models and numerical methods. CoRR 2019. arXiv:1904.05657. · Zbl 1429.68249
[9] Ben-Tal, A.; El Ghaoui, L.; Nemirovski, AS, Robust optimization (2009), Princeton: Princeton University Press, Princeton · Zbl 1221.90001
[10] Feinberg, EA; Shwartz, A., Handbook of Markov decision processes: methods and applications (2002), Berlin: Springer, Berlin · Zbl 0979.90001
[11] Neveu, J., Mathematical foundations of the calculus of probability (1965), Oakland: Holden-Day, Oakland · Zbl 0137.11301
[12] Yang Y, Caluwaerts K, Iscen A, Zhang T, Tan J, Sindhwani V. Data efficient reinforcement learning for legged robots. CoRR 2019. arXiv:1907.03613.
[13] Watkins, CJCH; Dayan, P., Q-learning, Mach. Learn., 8, 279-292 (1992) · Zbl 0773.68062
[14] Williams, RJ, Simple statistical gradient-following algorithms for connectionist reinforcement, Mach. Learn., 8, 229-256 (1992) · Zbl 0772.68076
[15] Buyya, R.; Calheiros, RN; Dastjerdi, AV, Big data: principles and paradigms (2016), Amsterdam: Elsevier, Amsterdam
[16] Grüne, L.; Junge, O., Gewöhnliche Differentialgleichungen (2016), Wiesbaden: Springer, Wiesbaden · Zbl 1385.34001
[17] Gerdts, M., Optimal control of ODEs and DAEs (2012), Berlin: De Gruyter, Berlin · Zbl 1275.49001
[18] Gerdts M. Optimal control of ordinary differential equations and differential algebraic equation. Habilitation, University of Bayreuth, Dr. Hut Verlag; 2007.
[19] Machielsen KCP. Numerical solution of optimal control problems with state constraints by sequential quadratic programming in function space. Centrum voor Wiskunde en Informatica. 1988. · Zbl 0661.65067
[20] Burger M. Optimal control of dynamical systems: calculating input data for multibody system simulation. Dissertation, Technical. University of Kaiserslautern, Dr. Hut Verlag; 2011.
[21] Roller, M.; Björkenstam, S.; Linn, J.; Leyendecker, S., Optimal control of a biomechanical multibody model for the dynamic simulation of working tasks, Proceedings of the 8th ECCOMAS thematic conference on multibody dynamics, 817-826 (2017)
[22] Obentheuer, M.; Roller, M.; Björkenstam, S.; Berns, K.; Linn, J., Comparison of different actuation modes of a biomechanical human arm model in an optimal control framework, Proceedings of the 5th joint international conference on multibody system dynamics (2018)
[23] Burger, M.; Gottschalk, S.; Roller, M., Reinforcement learning applied to a human arm model, Proceedings of the 9th ECCOMAS thematic conference on multibody dynamics, 68-75 (2019) · Zbl 1493.70034
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.