×

Two time-scale stochastic approximation with controlled Markov noise and off-policy temporal-difference learning. (English) Zbl 1434.62174

Summary: We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by “controlled” Markov noise. In particular, the faster and slower recursions have nonadditive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both time scales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solution to the off-policy convergence problem for temporal-difference learning with linear function approximation, using our results.

MSC:

62L20 Stochastic approximation
60J05 Discrete-time Markov processes on general state spaces
93E35 Stochastic learning and adaptive control

References:

[1] Aubin J, Cellina A (1984) Differential Inclusions: Set-Valued Maps and Viability Theory (Springer, Berlin). · Zbl 0538.34007 · doi:10.1007/978-3-642-69512-4
[2] Benaïm M (1999) Dynamics of stochastic approximation algorithms. Azéma J, Émery M, Ledoux M, Yor M, eds. Séminaire de probabilités XXXIII (Springer, Berlin), 1-68. · Zbl 0955.62085 · doi:10.1007/BFb0096509
[3] Benaïm M, Hofbauer J, Sorin S (2005) Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1):328-348. · Zbl 1087.62091 · doi:10.1137/S0363012904439301
[4] Benveniste A, Metivier M, Priouret P (1990) Adaptive Algorithms and Stochastic Approximation (Springer, New York). · Zbl 0752.93073 · doi:10.1007/978-3-642-75894-2
[5] Borkar VS (1995) Probability Theory: An Advanced Course (Springer, New York). · Zbl 0838.60001 · doi:10.1007/978-1-4612-0791-7
[6] Borkar VS (1997) Stochastic approximation with two time scales. Systems Control Lett. 29(5):291-294. · Zbl 0895.62085 · doi:10.1016/S0167-6911(97)90015-3
[7] Borkar VS (2006) Stochastic approximation with “controlled Markov noise.”Systems Control Lett. 55(2):139-145. · Zbl 1129.62407 · doi:10.1016/j.sysconle.2005.06.005
[8] Borkar VS (2008) Stochastic Approximation: A Dynamic Systems Viewpoint (Cambridge University Press, Cambridge, UK). · doi:10.1007/978-93-86279-38-5
[9] Degris T, White M, Sutton RS (2012) Linear off-policy actor-critic. Proc. 29th Internat. Conf. Machine Learning, ICML, ’12 (Omnipress, Madison, WI).
[10] Konda VR, Tsitsiklis JN (2003) Linear stochastic approximation driven by slowly varying Markov chains. Systems Control Lett. 50(2): 95-102. · Zbl 1157.93533 · doi:10.1016/S0167-6911(03)00132-4
[11] Konda VR, Tsitsiklis JN (2003) On actor-critic algorithms. SIAM J. Control Optim. 42(4):1143-1166. · Zbl 1049.93095 · doi:10.1137/S0363012901385691
[12] Ma DJ, Makowski AM, Shwartz A (1990) Stochastic approximations for finite state Markov chains. Stochastic Processes Their Appl. 35(1):27-45. · Zbl 0718.60074 · doi:10.1016/0304-4149(90)90120-H
[13] Maei HR (2011) Gradient temporal-difference learning algorithms. PhD thesis, University of Alberta, Alberta, Canada.
[14] Menache I, Mannor S, Shimkin N (2005) Basis function adaptation in temporal difference reinforcement learning. Ann. Oper. Res. 134(1):215-238. · Zbl 1075.90073 · doi:10.1007/s10479-005-5732-z
[15] Metivier M, Priouret P (1984) Applications of a Kushner and Clark lemma to general classes of stochastic algorithms. IEEE Trans. Inform. Theory 30(2):140-151. · Zbl 0546.62056 · doi:10.1109/TIT.1984.1056894
[16] Rudin W (1976) Principles of Mathematical Analysis, 3rd ed. (McGraw-Hill, New York). · Zbl 0148.02903
[17] Sutton RS, Maei RS, Szepesvári C (2008) A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Adv. Neural Inform. Processing Systems 21, NIPS ’08.
[18] Sutton RS, Maei HR, Precup D, Bhatnagar S, Silver D, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. Pohoreckyj Danyluk A, Bottou L, Littman ML eds. Proc. 26th Internat. Conf. Machine Learning, ICML ’10 (ACM, New York), 993-1000. · doi:10.1145/1553374.1553501
[19] Tadić VB (2004) Almost sure convergence of two time-scale stochastic approximation algorithms. Proc. 2004 Amer. Control Conf. (IEEE, Piscataway, NJ). · doi:10.23919/ACC.2004.1384505
[20] Tadić VB (2015) Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema. Stochastic Processes their Appl. 125(5):1715-1755. · Zbl 1357.62274 · doi:10.1016/j.spa.2014.11.001
[21] Yu H (2012) Least squares temporal difference methods: An analysis under general conditions. SIAM J. Control Optim. 50(6):3310-3343. · Zbl 1274.90478 · doi:10.1137/100807879
[22] Yu H (2016) Weak convergence properties of constrained emphatic temporal-difference learning with constant and slowly diminishing stepsize. · Zbl 1404.68124
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.