Document Zbl 07608712

Zhang, Kaiqing; Yang, Zhuoran; Başar, Tamer

Multi-agent reinforcement learning: a selective overview of theories and algorithms. (English) Zbl 07608712

Vamvoudakis, Kyriakos G. (ed.) et al., Handbook of reinforcement learning and control. Cham: Springer. Stud. Syst. Decis. Control 325, 321-384 (2021).

Summary: Recent years have witnessed significant advances in reinforcement learning (RL), which has registered tremendous success in solving various sequential decision-making problems in machine learning. Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of multi-agent RL (MARL), a domain with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques. Though empirically successful, theoretical foundations for MARL are relatively lacking in the literature. In this chapter, we provide a selective overview of MARL, with focus on algorithms backed by theoretical analysis. More specifically, we review the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two. We also introduce several significant but challenging applications of these algorithms. Orthogonal to the existing reviews on MARL, we highlight several new angles and taxonomies of MARL theory, including learning in extensive-form games, decentralized MARL with networked agents, MARL in the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. Some of the new angles extrapolate from our own research endeavors and interests. Our overall goal with this chapter is, beyond providing an assessment of the current state of the field on the mark, to identify fruitful future research directions on theoretical studies of MARL. We expect this chapter to serve as continuing stimulus for researchers interested in working on this exciting while challenging topic.
For the entire collection see [Zbl 1492.49001].

Cited in 26 Documents

MSC:

68Txx

Artificial intelligence

Software:

R-MAX; Pluribus; ImageNet; AlexNet; Libratus; AlphaZero; DeepStack; Tensor2Tensor; AWESOME

Cite Review PDF

Full Text: DOI arXiv

References:

[1]	Silver, D.; Huang, A.; Maddison, CJ; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M., Mastering the game of Go with deep neural networks and tree search, Nature, 529, 7587, 484-489 (2016) · doi:10.1038/nature16961
[2]	Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A., Mastering the game of Go without human knowledge, Nature, 550, 7676, 354 (2017) · doi:10.1038/nature24270
[3]	OpenAI: Openai five. https://blog.openai.com/openai-five/ (2018)
[4]	Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W.M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou, J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets, S., Molloy, J., Cai, T., Budden, D., Paine, T., Gulcehre, C., Wang, Z., Pfaff, T., Pohlen, T., Wu, Y., Yogatama, D., Cohen, J., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C., Kavukcuoglu, K., Hassabis, D., Silver, D.: AlphaStar: mastering the real-time strategy game starcraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ (2019)
[5]	Kober, J.; Bagnell, JA; Peters, J., Reinforcement learning in robotics: a survey, Int. J. Robot. Res., 32, 11, 1238-1274 (2013) · doi:10.1177/0278364913495721
[6]	Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. In: International Conference on Learning Representations (2016)
[7]	Brown, N., Sandholm, T.: Libratus: the superhuman AI for no-limit Poker. In: International Joint Conference on Artificial Intelligence, pp. 5226-5228 (2017)
[8]	Brown, N.; Sandholm, T., Superhuman AI for multiplayer poker, Science, 365, 885-890 (2019) · Zbl 1433.68316 · doi:10.1126/science.aay2400
[9]	Shalev-Shwartz, S., Shammah, S., Shashua, A.: Safe, multi-agent, reinforcement learning for autonomous driving (2016). arXiv preprint arXiv:1610.03295
[10]	Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, AA; Veness, J.; Bellemare, MG; Graves, A.; Riedmiller, M.; Fidjeland, AK; Ostrovski, G., Human-level control through deep reinforcement learning, Nature, 518, 7540, 529-533 (2015) · doi:10.1038/nature14236
[11]	Busoniu, L.; Babuska, R.; De Schutter, B., A comprehensive survey of multiagent reinforcement learning, IEEE Trans. Syst. Man Cybern. Part C, 38, 2, 156-172 (2008) · doi:10.1109/TSMCC.2007.913919
[12]	Adler, J.L., Blue, V.J.: A cooperative multi-agent transportation management and route guidance system. Transp. Res. Part C: Emerg. Technol. 10(5), 433-454 (2002)
[13]	Wang, S., Wan, J., Zhang, D., Li, D., Zhang, C.: Towards smart factory for industry 4.0: a self-organized multi-agent system with big data based feedback and coordination. Comput. Netw. 101, 158-168 (2016)
[14]	Jangmin, O., Lee, J.W., Zhang, B.T.: Stock trading system using reinforcement learning with cooperative agents. In: International Conference on Machine Learning, pp. 451-458 (2002)
[15]	Lee, J.W., Park, J., Jangmin, O., Lee, J., Hong, E.: A multiagent approach to \(Q \)-learning for daily stock trading. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 37(6), 864-877 (2007)
[16]	Cortes, J.; Martinez, S.; Karatas, T.; Bullo, F., Coverage control for mobile sensing networks, IEEE Trans. Robot. Autom., 20, 2, 243-255 (2004) · doi:10.1109/TRA.2004.824698
[17]	Choi, J.; Oh, S.; Horowitz, R., Distributed learning and cooperative control for multi-agent systems, Automatica, 45, 12, 2802-2814 (2009) · Zbl 1192.93011 · doi:10.1016/j.automatica.2009.09.025
[18]	Castelfranchi, C., The theory of social functions: challenges for computational social science and multi-agent learning, Cogn. Syst. Res., 2, 1, 5-38 (2001) · doi:10.1016/S1389-0417(01)00013-4
[19]	Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 464-473 (2017)
[20]	Hernandez-Leal, P., Kartal, B., Taylor, M.E.: A survey and critique of multiagent deep reinforcement learning (2018). arXiv preprint arXiv:1810.05587
[21]	Foerster, J., Assael, Y.M., de Freitas, N., Whiteson, S.: Learning to communicate with deep multi-agent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 2137-2145 (2016)
[22]	Zazo, S.; Macua, SV; Sánchez-Fernández, M.; Zazo, J., Dynamic potential games with constraints: fundamentals and applications in communications, IEEE Trans. Signal Process., 64, 14, 3806-3821 (2016) · Zbl 1414.94724 · doi:10.1109/TSP.2016.2551693
[23]	Zhang, K., Yang, Z., Liu, H., Zhang, T., Başar, T.: Fully decentralized multi-agent reinforcement learning with networked agents. In: International Conference on Machine Learning, pp. 5867-5876 (2018)
[24]	Subramanian, J., Mahajan, A.: Reinforcement learning in stationary mean-field games. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 251-259 (2019)
[25]	Heinrich, J., Silver, D.: Deep reinforcement learning from self-play in imperfect-information games (2016). arXiv preprint arXiv:1603.01121
[26]	Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in Neural Information Processing Systems, pp. 6379-6390 (2017)
[27]	Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients (2017). arXiv preprint arXiv:1705.08926
[28]	Gupta, J.K., Egorov, M., Kochenderfer, M.: Cooperative multi-agent control using deep reinforcement learning. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 66-83 (2017)
[29]	Omidshafiei, S., Pazis, J., Amato, C., How, J.P., Vian, J.: Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In: International Conference on Machine Learning, pp. 2681-2690 (2017)
[30]	Kawamura, K., Mizukami, N., Tsuruoka, Y.: Neural fictitious self-play in imperfect information games with many players. In: Workshop on Computer Games, pp. 61-74 (2017)
[31]	Zhang, L., Wang, W., Li, S., Pan, G.: Monte Carlo neural fictitious self-play: Approach to approximate Nash equilibrium of imperfect-information games (2019). arXiv preprint arXiv:1903.09569
[32]	Mazumdar, E., Ratliff, L.J.: On the convergence of gradient-based learning in continuous games (2018). arXiv preprint arXiv:1804.05464
[33]	Jin, C., Netrapalli, P., Jordan, M.I.: Minmax optimization: stable limit points of gradient descent ascent are locally optimal (2019). arXiv preprint arXiv:1902.00618
[34]	Zhang, K., Yang, Z., Başar, T.: Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. In: Advances in Neural Information Processing Systems (2019)
[35]	Sidford, A., Wang, M., Yang, L.F., Ye, Y.: Solving discounted stochastic two-player games with near-optimal time and sample complexity (2019). arXiv preprint arXiv:1908.11071
[36]	Oliehoek, F.A., Amato, C.: A Concise Introduction to Decentralized POMDPs, vol. 1. Springer, Berlin (2016) · Zbl 1355.68005
[37]	Arslan, G.; Yüksel, S., Decentralized Q-learning for stochastic teams and games, IEEE Trans. Autom. Control, 62, 4, 1545-1558 (2017) · Zbl 1366.91030 · doi:10.1109/TAC.2016.2598476
[38]	Yongacoglu, B., Arslan, G., Yüksel, S.: Learning team-optimality for decentralized stochastic control and dynamic games (2019). arXiv preprint arXiv:1903.05812
[39]	Zhang, K., Miehling, E., Başar, T.: Online planning for decentralized stochastic control with partial history sharing. In: IEEE American Control Conference, pp. 167-172 (2019)
[40]	Hernandez-Leal, P., Kaisers, M., Baarslag, T., de Cote, E.M.: A survey of learning in multiagent environments: dealing with non-stationarity (2017). arXiv preprint arXiv:1707.09183
[41]	Nguyen, T.T., Nguyen, N.D., Nahavandi, S.: Deep reinforcement learning for multi-agent systems: a review of challenges, solutions and applications (2018). arXiv preprint arXiv:1812.11794
[42]	Oroojlooy Jadid, A., Hajinezhad, D.: A review of cooperative multi-agent deep reinforcement learning (2019). arXiv preprint arXiv:1908.03963
[43]	Zhang, K., Yang, Z., Başar, T.: Networked multi-agent reinforcement learning in continuous spaces. In: IEEE Conference on Decision and Control, pp. 2771-2776 (2018)
[44]	Zhang, K., Yang, Z., Liu, H., Zhang, T., Başar, T.: Finite-sample analyses for fully decentralized multi-agent reinforcement learning (2018). arXiv preprint arXiv:1812.02783
[45]	Monahan, GE, State of the art-a survey of partially observable Markov decision processes: theory, models, and algorithms, Manag. Sci., 28, 1, 1-16 (1982) · Zbl 0486.90084 · doi:10.1287/mnsc.28.1.1
[46]	Cassandra, A.R.: Exact and approximate algorithms for partially observable Markov decision processes. Brown University (1998)
[47]	Bertsekas, DP, Dynamic Programming and Optimal Control (2005), Belmont: Athena Scientific, Belmont · Zbl 1125.90056
[48]	Watkins, CJ; Dayan, P., Q-learning, Mach. Learn., 8, 3-4, 279-292 (1992) · Zbl 0773.68062
[49]	Szepesvári, C.; Littman, ML, A unified analysis of value-function-based reinforcement-learning algorithms, Neural Comput., 11, 8, 2017-2060 (1999) · doi:10.1162/089976699300016070
[50]	Singh, S.; Jaakkola, T.; Littman, ML; Szepesvári, C., Convergence results for single-step on-policy reinforcement-learning algorithms, Mach. Learn., 38, 3, 287-308 (2000) · Zbl 0954.68127 · doi:10.1023/A:1007678930559
[51]	Chang, HS; Fu, MC; Hu, J.; Marcus, SI, An adaptive sampling algorithm for solving Markov decision processes, Oper. Res., 53, 1, 126-139 (2005) · Zbl 1165.90672 · doi:10.1287/opre.1040.0145
[52]	Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: European Conference on Machine Learning, pp. 282-293. Springer (2006)
[53]	Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search. In: International Conference on Computers and Games, pp. 72-83 (2006)
[54]	Agrawal, R., Sample mean based index policies by \(O(log n)\) regret for the multi-armed bandit problem, Adv. Appl. Probab., 27, 4, 1054-1078 (1995) · Zbl 0840.90129 · doi:10.2307/1427934
[55]	Auer, P.; Cesa-Bianchi, N.; Fischer, P., Finite-time analysis of the multiarmed bandit problem, Mach. Learn., 47, 2-3, 235-256 (2002) · Zbl 1012.68093 · doi:10.1023/A:1013689704352
[56]	Jiang, D., Ekwedike, E., Liu, H.: Feedback-based tree search for reinforcement learning. In: International Conference on Machine Learning, pp. 2284-2293 (2018)
[57]	Shah, D., Xie, Q., Xu, Z.: On reinforcement learning using Monte-Carlo tree search with supervised learning: non-asymptotic analysis (2019). arXiv preprint arXiv:1902.05213
[58]	Tesauro, G., Temporal difference learning and TD-Gammon, Commun. ACM, 38, 3, 58-68 (1995) · doi:10.1145/203330.203343
[59]	Tsitsiklis, J.N., Van Roy, B.: Analysis of temporal-difference learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1075-1081 (1997)
[60]	Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018) · Zbl 1407.68009
[61]	Sutton, R.S., Szepesvári, C., Maei, H.R.: A convergent \(O(n)\) algorithm for off-policy temporal-difference learning with linear function approximation. In: Advances in Neural Information Processing Systems, vol. 21(21), pp. 1609-1616 (2008)
[62]	Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: International Conference on Machine Learning, pp. 993-1000 (2009)
[63]	Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., Petrik, M.: Finite-sample analysis of proximal gradient TD algorithms. In: Conference on Uncertainty in Artificial Intelligence, pp. 504-513 (2015)
[64]	Bhatnagar, S., Precup, D., Silver, D., Sutton, R.S., Maei, H.R., Szepesvári, C.: Convergent temporal-difference learning with arbitrary smooth function approximation. In: Advances in Neural Information Processing Systems, pp. 1204-1212 (2009)
[65]	Dann, C.; Neumann, G.; Peters, J., Policy evaluation with temporal differences: a survey and comparison, J. Mach. Learn. Res., 15, 809-883 (2014) · Zbl 1317.68150
[66]	Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057-1063 (2000)
[67]	Williams, RJ, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Mach. Learn., 8, 3-4, 229-256 (1992) · Zbl 0772.68076
[68]	Baxter, J.; Bartlett, PL, Infinite-horizon policy-gradient estimation, J. Artif. Intell. Res., 15, 319-350 (2001) · Zbl 0994.68119 · doi:10.1613/jair.806
[69]	Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008-1014 (2000) · Zbl 1049.93095
[70]	Bhatnagar, S.; Sutton, R.; Ghavamzadeh, M.; Lee, M., Natural actor-critic algorithms, Automatica, 45, 11, 2471-2482 (2009) · Zbl 1183.93130 · doi:10.1016/j.automatica.2009.07.008
[71]	Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387-395 (2014)
[72]	Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms (2017). arXiv preprint arXiv:1707.06347
[73]	Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889-1897 (2015)
[74]	Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor (2018). arXiv preprint arXiv:1801.01290
[75]	Yang, Z., Zhang, K., Hong, M., Başar, T.: A finite sample analysis of the actor-critic algorithm. In: IEEE Conference on Decision and Control, pp. 2759-2764 (2018)
[76]	Zhang, K., Koppel, A., Zhu, H., Başar, T.: Global convergence of policy gradient methods to (almost) locally optimal policies (2019). arXiv preprint arXiv:1906.08383 · Zbl 1451.93379
[77]	Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: Optimality and approximation with policy gradient methods in Markov decision processes (2019). arXiv preprint arXiv:1908.00261
[78]	Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural proximal/trust region policy optimization attains globally optimal policy (2019). arXiv preprint arXiv:1906.10306
[79]	Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence (2019). arXiv preprint arXiv:1909.01150
[80]	Chen, Y., Wang, M.: Stochastic primal-dual methods and sample complexity of reinforcement learning (2016). arXiv preprint arXiv:1612.02516
[81]	Wang, M.: Primal-dual \(\pi\) learning: sample complexity and sublinear run time for ergodic Markov decision problems (2017). arXiv preprint arXiv:1710.06100
[82]	Shapley, LS, Stochastic games, Proc. Natl. Acad. Sci., 39, 10, 1095-1100 (1953) · Zbl 0051.35805 · doi:10.1073/pnas.39.10.1953
[83]	Littman, M.L.: Markov games as a framework for multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 157-163 (1994)
[84]	Başar, T., Olsder, G.J.: Dynamic Noncooperative Game Theory, vol. 23. SIAM, Philadelphia (1999) · Zbl 0946.91001
[85]	Filar, J., Vrieze, K.: Competitive Markov Decision Processes. Springer Science & Business Media, Berlin (2012) · Zbl 0934.91002
[86]	Boutilier, C.: Planning, learning and coordination in multi-agent decision processes. In: Conference on Theoretical Aspects of Rationality and Knowledge, pp. 195-210 (1996)
[87]	Lauer, M., Riedmiller, M.: An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In: International Conference on Machine Learning (2000)
[88]	Yoshikawa, T., Decomposition of dynamic team decision problems, IEEE Trans. Autom. Control, 23, 4, 627-632 (1978) · Zbl 0381.93007 · doi:10.1109/TAC.1978.1101791
[89]	Ho, YC, Team decision theory and information structures, Proc. IEEE, 68, 6, 644-654 (1980) · doi:10.1109/PROC.1980.11718
[90]	Wang, X., Sandholm, T.: Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In: Advances in Neural Information Processing Systems, pp. 1603-1610 (2003)
[91]	Mahajan, A.: Sequential decomposition of sequential dynamic teams: applications to real-time communication and networked control systems. Ph.D. thesis, University of Michigan (2008)
[92]	González-Sánchez, D., Hernández-Lerma, O.: Discrete-Time Stochastic Control and Dynamic Potential Games: The Euler-Equation Approach. Springer Science & Business Media, Berlin (2013) · Zbl 1344.93001
[93]	Valcarcel Macua, S., Zazo, J., Zazo, S.: Learning parametric closed-loop policies for Markov potential games. In: International Conference on Learning Representations (2018) · Zbl 1414.94724
[94]	Kar, S.; Moura, JM; Poor, HV, QD-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations, IEEE Trans. Signal Process., 61, 7, 1848-1862 (2013) · Zbl 1393.94293 · doi:10.1109/TSP.2013.2241057
[95]	Doan, T., Maguluri, S., Romberg, J.: Finite-time analysis of distributed TD (0) with linear function approximation on multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 1626-1635 (2019)
[96]	Wai, H.T., Yang, Z., Wang, Z., Hong, M.: Multi-agent reinforcement learning via double averaging primal-dual optimization. In: Advances in Neural Information Processing Systems, pp. 9649-9660 (2018)
[97]	OpenAI: Openai dota 2 1v1 bot. https://openai.com/the-international/ (2017)
[98]	Jacobson, D., Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games, IEEE Trans. Autom. Control, 18, 2, 124-131 (1973) · Zbl 0274.93067 · doi:10.1109/TAC.1973.1100265
[99]	Başar, T.; Bernhard, P., \(H_\infty\) Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach (1995), Boston: Birkhäuser, Boston · Zbl 0835.93001
[100]	Zhang, K., Hu, B., Başar, T.: Policy optimization for \(\cal{H}_2\) linear control with \(\cal{H}_{\infty }\) robustness guarantee: implicit regularization and global convergence (2019). arXiv preprint arXiv:1910.09496
[101]	Hu, J., Wellman, M.P.: Nash Q-learning for general-sum stochastic games. J. Mach. Learn. Res. 4, 1039-1069 (2003) · Zbl 1094.68076
[102]	Littman, M.L.: Friend-or-foe Q-learning in general-sum games. In: International Conference on Machine Learning, pp. 322-328 (2001)
[103]	Lagoudakis, M.G., Parr, R.: Learning in zero-sum team Markov games using factored value functions. In: Advances in Neural Information Processing Systems, pp. 1659-1666 (2003)
[104]	Bernstein, DS; Givan, R.; Immerman, N.; Zilberstein, S., The complexity of decentralized control of Markov decision processes, Math. Oper. Res., 27, 4, 819-840 (2002) · Zbl 1082.90593 · doi:10.1287/moor.27.4.819.297
[105]	Osborne, M.J., Rubinstein, A.: A Course in Game Theory. MIT Press, Cambridge (1994) · Zbl 1194.91003
[106]	Shoham, Y., Leyton-Brown, K.: Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, Cambridge (2008) · Zbl 1163.91006
[107]	Koller, D.; Megiddo, N., The complexity of two-person zero-sum games in extensive form, Games Econ. Behav., 4, 4, 528-552 (1992) · Zbl 0758.90084 · doi:10.1016/0899-8256(92)90035-Q
[108]	Kuhn, H., Extensive games and the problem of information, Contrib. Theory Games, 2, 193-216 (1953) · Zbl 0050.14303
[109]	Zinkevich, M., Johanson, M., Bowling, M., Piccione, C.: Regret minimization in games with incomplete information. In: Advances in Neural Information Processing Systems, pp. 1729-1736 (2008)
[110]	Heinrich, J., Lanctot, M., Silver, D.: Fictitious self-play in extensive-form games. In: International Conference on Machine Learning, pp. 805-813 (2015)
[111]	Srinivasan, S., Lanctot, M., Zambaldi, V., Pérolat, J., Tuyls, K., Munos, R., Bowling, M.: Actor-critic policy optimization in partially observable multiagent environments. In: Advances in Neural Information Processing Systems, pp. 3422-3435 (2018)
[112]	Omidshafiei, S., Hennes, D., Morrill, D., Munos, R., Perolat, J., Lanctot, M., Gruslys, A., Lespiau, J.B., Tuyls, K.: Neural replicator dynamics (2019). arXiv preprint arXiv:1906.00190
[113]	Rubin, J., Watson, I.: Computer Poker: a review. Artif. Intell. 175(5-6), 958-987 (2011)
[114]	Lanctot, M., Lockhart, E., Lespiau, J.B., Zambaldi, V., Upadhyay, S., Pérolat, J., Srinivasan, S., Timbers, F., Tuyls, K., Omidshafiei, S., et al.: Openspiel: a framework for reinforcement learning in games (2019). arXiv preprint arXiv:1908.09453
[115]	Claus, C., Boutilier, C.: The dynamics of reinforcement learning in cooperative multiagent systems. In: AAAI Conference on Artificial Intelligence, vol. 1998, pp. 746-752, 2p. (1998)
[116]	Bowling, M., Veloso, M.: Rational and convergent learning in stochastic games. In: International Joint Conference on Artificial Intelligence, vol. 17, pp. 1021-1026 (2001)
[117]	Kapetanakis, S., Kudenko, D.: Reinforcement learning of coordination in cooperative multi-agent systems. In: AAAI Conference on Artificial Intelligence, vol. 2002, pp. 326-331 (2002)
[118]	Conitzer, V.; Sandholm, T., Awesome: a general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents, Mach. Learn., 67, 1-2, 23-43 (2007) · Zbl 1471.91075 · doi:10.1007/s10994-006-0143-1
[119]	Hansen, E.A., Bernstein, D.S., Zilberstein, S.: Dynamic programming for partially observable stochastic games. In: AAAI Conference on Artificial Intelligence, pp. 709-715 (2004)
[120]	Amato, C., Chowdhary, G., Geramifard, A., Üre, N.K., Kochenderfer, M.J.: Decentralized control of partially observable Markov decision processes. In: IEEE Conference on Decision and Control, pp. 2398-2405 (2013)
[121]	Amato, C., Oliehoek, F.A.: Scalable planning and learning for multiagent POMDPs. In: AAAI Conference on Artificial Intelligence (2015)
[122]	Shoham, Y., Powers, R., Grenager, T.: Multi-agent reinforcement learning: a critical survey. Technical Report (2003) · Zbl 1168.68493
[123]	Zinkevich, M., Greenwald, A., Littman, M.L.: Cyclic equilibria in Markov games. In: Advances in Neural Information Processing Systems, pp. 1641-1648 (2006)
[124]	Bowling, M.; Veloso, M., Multiagent learning using a variable learning rate, Artif. Intell., 136, 2, 215-250 (2002) · Zbl 0995.68075 · doi:10.1016/S0004-3702(02)00121-2
[125]	Bowling, M.: Convergence and no-regret in multiagent learning. In: Advances in Neural Information Processing Systems, pp. 209-216 (2005)
[126]	Blum, A., Mansour, Y.: Learning, regret minimization, and equilibria. In: Algorithmic Game Theory, pp. 79-102 (2007) · Zbl 1143.91311
[127]	Hart, S., Mas-Colell, A.: A reinforcement procedure leading to correlated equilibrium. In: Economics Essays, pp. 181-200. Springer, Berlin (2001) · Zbl 1023.91004
[128]	Kasai, T., Tenmoto, H., Kamiya, A.: Learning of communication codes in multi-agent reinforcement learning problem. In: IEEE Conference on Soft Computing in Industrial Applications, pp. 1-6 (2008)
[129]	Kim, D., Moon, S., Hostallero, D., Kang, W.J., Lee, T., Son, K., Yi, Y.: Learning to schedule communication in multi-agent reinforcement learning. In: International Conference on Learning Representations (2019)
[130]	Chen, T., Zhang, K., Giannakis, G.B., Başar, T.: Communication-efficient distributed reinforcement learning (2018). arXiv preprint arXiv:1812.03239
[131]	Lin, Y., Zhang, K., Yang, Z., Wang, Z., Başar, T., Sandhu, R., Liu, J.: A communication-efficient multi-agent actor-critic algorithm for distributed reinforcement learning. In: IEEE Conference on Decision and Control (2019)
[132]	Ren, J., Haupt, J.: A communication efficient hierarchical distributed optimization algorithm for multi-agent reinforcement learning. In: Real-World Sequential Decision Making Workshop at International Conference on Machine Learning (2019)
[133]	Kim, W., Cho, M., Sung, Y.: Message-dropout: an efficient training method for multi-agent deep reinforcement learning. In: AAAI Conference on Artificial Intelligence (2019)
[134]	He, H., Boyd-Graber, J., Kwok, K., Daumé III, H.: Opponent modeling in deep reinforcement learning. In: International Conference on Machine Learning, pp. 1804-1813 (2016)
[135]	Grover, A., Al-Shedivat, M., Gupta, J., Burda, Y., Edwards, H.: Learning policy representations in multiagent systems. In: International Conference on Machine Learning, pp. 1802-1811 (2018)
[136]	Gao, C., Mueller, M., Hayward, R.: Adversarial policy gradient for alternating Markov games. In: Workshop at International Conference on Learning Representations (2018)
[137]	Li, S., Wu, Y., Cui, X., Dong, H., Fang, F., Russell, S.: Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In: AAAI Conference on Artificial Intelligence (2019)
[138]	Zhang, X., Zhang, K., Miehling, E., Basar, T.: Non-cooperative inverse reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 9482-9493 (2019)
[139]	Tan, M.: Multi-agent reinforcement learning: Independent vs. cooperative agents. In: International Conference on Machine Learning, pp. 330-337 (1993)
[140]	Matignon, L.; Laurent, GJ; Le Fort-Piat, N., Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems, Knowl. Eng. Rev., 27, 1, 1-31 (2012) · doi:10.1017/S0269888912000057
[141]	Foerster, J., Nardelli, N., Farquhar, G., Torr, P., Kohli, P., Whiteson, S., et al.: Stabilising experience replay for deep multi-agent reinforcement learning. In: International Conference of Machine Learning, pp. 1146-1155 (2017)
[142]	Tuyls, K.; Weiss, G., Multiagent learning: basics, challenges, and prospects, AI Mag., 33, 3, 41 (2012)
[143]	Guestrin, C., Lagoudakis, M., Parr, R.: Coordinated reinforcement learning. In: International Conference on Machine Learning, pp. 227-234 (2002)
[144]	Guestrin, C., Koller, D., Parr, R.: Multiagent planning with factored MDPs. In: Advances in Neural Information Processing Systems, pp. 1523-1530 (2002)
[145]	Kok, J.R., Vlassis, N.: Sparse cooperative Q-learning. In: International Conference on Machine learning, pp. 61-69 (2004)
[146]	Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W.M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J.Z., Tuyls, K., et al.: Value-decomposition networks for cooperative multi-agent learning based on team reward. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 2085-2087 (2018)
[147]	Rashid, T., Samvelyan, M., De Witt, C.S., Farquhar, G., Foerster, J., Whiteson, S.: QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 681-689 (2018) · Zbl 1527.68192
[148]	Qu, G., Li, N.: Exploiting fast decaying and locality in multi-agent MDP with tree dependence structure. In: IEEE Conference on Decision and Control (2019)
[149]	Mahajan, A., Optimal decentralized control of coupled subsystems with control sharing, IEEE Trans. Autom. Control, 58, 9, 2377-2382 (2013) · Zbl 1369.93721 · doi:10.1109/TAC.2013.2251807
[150]	Oliehoek, F.A., Amato, C.: Dec-POMDPs as non-observable MDPs. IAS Technical Report (IAS-UVA-14-01) (2014)
[151]	Foerster, J.N., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients. In: AAAI Conference on Artificial Intelligence (2018)
[152]	Dibangoye, J., Buffet, O.: Learning to act in decentralized partially observable MDPs. In: International Conference on Machine Learning, pp. 1233-1242 (2018)
[153]	Kraemer, L.; Banerjee, B., Multi-agent reinforcement learning as a rehearsal for decentralized planning, Neurocomputing, 190, 82-94 (2016) · doi:10.1016/j.neucom.2016.01.031
[154]	Macua, SV; Chen, J.; Zazo, S.; Sayed, AH, Distributed policy evaluation under multiple behavior strategies, IEEE Trans. Autom. Control, 60, 5, 1260-1274 (2015) · Zbl 1360.68714 · doi:10.1109/TAC.2014.2368731
[155]	Macua, S.V., Tukiainen, A., Hernández, D.G.O., Baldazo, D., de Cote, E.M., Zazo, S.: Diff-dac: Distributed actor-critic for average multitask deep reinforcement learning (2017). arXiv preprint arXiv:1710.10363
[156]	Lee, D., Yoon, H., Hovakimyan, N.: Primal-dual algorithm for distributed reinforcement learning: distributed GTD. In: IEEE Conference on Decision and Control, pp. 1967-1972 (2018)
[157]	Doan, T.T., Maguluri, S.T., Romberg, J.: Finite-time performance of distributed temporal difference learning with linear function approximation (2019). arXiv preprint arXiv:1907.12530 · Zbl 1483.68294
[158]	Suttle, W., Yang, Z., Zhang, K., Wang, Z., Başar, T., Liu, J.: A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning (2019). arXiv preprint arXiv:1903.06372
[159]	Littman, ML, Value-function reinforcement learning in Markov games, Cogn. Syst. Res., 2, 1, 55-66 (2001) · doi:10.1016/S1389-0417(01)00015-8
[160]	Young, H.P.: The evolution of conventions. Econ.: J. Econ. Soc. 57-84 (1993) · Zbl 0773.90101
[161]	Son, K., Kim, D., Kang, W.J., Hostallero, D.E., Yi, Y.: QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 5887-5896 (2019)
[162]	Perolat, J., Piot, B., Pietquin, O.: Actor-critic fictitious play in simultaneous move multistage games. In: International Conference on Artificial Intelligence and Statistics (2018)
[163]	Monderer, D.; Shapley, LS, Potential games, Games Econ. Behav., 14, 1, 124-143 (1996) · Zbl 0862.90137 · doi:10.1006/game.1996.0044
[164]	Başar, T., Zaccour, G.: Handbook of Dynamic Game Theory. Springer, Berlin (2018) · Zbl 1394.91001
[165]	Huang, M., Caines, P.E., Malhamé, R.P.: Individual and mass behaviour in large population stochastic wireless power control problems: centralized and Nash equilibrium solutions. In: IEEE Conference on Decision and Control, pp. 98-103 (2003)
[166]	Huang, M.; Malhamé, RP; Caines, PE, Large population stochastic dynamic games: closed-loop Mckean-Vlasov systems and the Nash certainty equivalence principle, Commun. Inf. Syst., 6, 3, 221-252 (2006) · Zbl 1136.91349
[167]	Lasry, JM; Lions, PL, Mean field games, Jpn. J. Math., 2, 1, 229-260 (2007) · Zbl 1156.91321 · doi:10.1007/s11537-007-0657-8
[168]	Bensoussan, A., Frehse, J., Yam, P., et al.: Mean Field Games and Mean Field Type Control Theory, vol. 101. Springer, Berlin (2013) · Zbl 1287.93002
[169]	Tembine, H.; Zhu, Q.; Başar, T., Risk-sensitive mean-field games, IEEE Trans. Autom. Control, 59, 4, 835-850 (2013) · Zbl 1360.49032 · doi:10.1109/TAC.2013.2289711
[170]	Arabneydi, J., Mahajan, A.: Team optimal control of coupled subsystems with mean-field sharing. In: IEEE Conference on Decision and Control, pp. 1669-1674 (2014)
[171]	Arabneydi, J.: New concepts in team theory: Mean field teams and reinforcement learning. Ph.D. thesis, McGill University (2017)
[172]	Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., Wang, J.: Mean field multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 5571-5580 (2018)
[173]	Witsenhausen, HS, Separation of estimation and control for discrete time systems, Proc. IEEE, 59, 11, 1557-1566 (1971) · doi:10.1109/PROC.1971.8488
[174]	Yüksel, S., Başar, T.: Stochastic Networked Control Systems: Stabilization and Optimization Under Information Constraints. Springer Science & Business Media, Berlin (2013) · Zbl 1280.93003
[175]	Subramanian, J., Seraj, R., Mahajan, A.: Reinforcement learning for mean-field teams. In: Workshop on Adaptive and Learning Agents at International Conference on Autonomous Agents and Multi-Agent Systems (2018)
[176]	Arabneydi, J., Mahajan, A.: Linear quadratic mean field teams: optimal and approximately optimal decentralized solutions (2016). arXiv preprint arXiv:1609.00056
[177]	Carmona, R., Laurière, M., Tan, Z.: Linear-quadratic mean-field reinforcement learning: convergence of policy gradient methods (2019). arXiv preprint arXiv:1910.04295
[178]	Carmona, R., Laurière, M., Tan, Z.: Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning (2019). arXiv preprint arXiv:1910.12802
[179]	Rabbat, M., Nowak, R.: Distributed optimization in sensor networks. In: International Symposium on Information Processing in Sensor Networks, pp. 20-27 (2004)
[180]	Dall’Anese, E.; Zhu, H.; Giannakis, GB, Distributed optimal power flow for smart microgrids, IEEE Trans. Smart Grid, 4, 3, 1464-1475 (2013) · doi:10.1109/TSG.2013.2248175
[181]	Zhang, K.; Shi, W.; Zhu, H.; Dall’Anese, E.; Başar, T., Dynamic power distribution system management with a locally connected communication network, IEEE J. Sel. Top. Signal Process., 12, 4, 673-687 (2018) · doi:10.1109/JSTSP.2018.2837338
[182]	Zhang, K.; Lu, L.; Lei, C.; Zhu, H.; Ouyang, Y., Dynamic operations and pricing of electric unmanned aerial vehicle systems and power networks, Transp. Res. Part C: Emerg. Technol., 92, 472-485 (2018) · doi:10.1016/j.trc.2018.05.011
[183]	Corke, P., Peterson, R., Rus, D.: Networked robots: flying robot navigation using a sensor net. Robot. Res. 234-243 (2005)
[184]	Zhang, K., Liu, Y., Liu, J., Liu, M., Başar, T.: Distributed learning of average belief over networks using sequential observations. Automatica (2019) · Zbl 1436.93012
[185]	Nedic, A.; Ozdaglar, A., Distributed subgradient methods for multi-agent optimization, IEEE Trans. Autom. Control, 54, 1, 48-61 (2009) · Zbl 1367.90086 · doi:10.1109/TAC.2008.2009515
[186]	Agarwal, A., Duchi, J.C.: Distributed delayed stochastic optimization. In: Advances in Neural Information Processing Systems, pp. 873-881 (2011)
[187]	Jakovetic, D.; Xavier, J.; Moura, JM, Cooperative convex optimization in networked systems: augmented Lagrangian algorithms with directed gossip communication, IEEE Trans. Signal Process., 59, 8, 3889-3902 (2011) · Zbl 1392.94018 · doi:10.1109/TSP.2011.2146776
[188]	Tu, SY; Sayed, AH, Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks, IEEE Trans. Signal Process., 60, 12, 6217-6234 (2012) · Zbl 1393.94465 · doi:10.1109/TSP.2012.2217338
[189]	Varshavskaya, P., Kaelbling, L.P., Rus, D.: Efficient distributed reinforcement learning through agreement. In: Distributed Autonomous Robotic Systems, pp. 367-378 (2009)
[190]	Ciosek, K., Whiteson, S.: Expected policy gradients for reinforcement learning (2018). arXiv preprint arXiv:1801.03326 · Zbl 1498.68229
[191]	Sutton, RS; Mahmood, AR; White, M., An emphatic approach to the problem of off-policy temporal-difference learning, J. Mach. Learn. Res., 17, 1, 2603-2631 (2016) · Zbl 1360.68712
[192]	Yu, H.: On convergence of emphatic temporal-difference learning. In: Conference on Learning Theory, pp. 1724-1751 (2015)
[193]	Zhang, Y., Zavlanos, M.M.: Distributed off-policy actor-critic reinforcement learning with policy consensus (2019). arXiv preprint arXiv:1903.09255
[194]	Pennesi, P.; Paschalidis, IC, A distributed actor-critic algorithm and applications to mobile sensor network coordination problems, IEEE Trans. Autom. Control, 55, 2, 492-497 (2010) · Zbl 1368.90026 · doi:10.1109/TAC.2009.2037462
[195]	Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. In: Reinforcement Learning, pp. 45-73. Springer, Berlin (2012)
[196]	Riedmiller, M.: Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method. In: European Conference on Machine Learning, pp. 317-328 (2005)
[197]	Antos, A., Szepesvári, C., Munos, R.: Fitted Q-iteration in continuous action-space MDPs. In: Advances in Neural Information Processing Systems, pp. 9-16 (2008)
[198]	Hong, M.; Chang, TH, Stochastic proximal gradient consensus over random networks, IEEE Trans. Signal Process., 65, 11, 2933-2948 (2017) · Zbl 1414.94247 · doi:10.1109/TSP.2017.2673815
[199]	Nedic, A.; Olshevsky, A.; Shi, W., Achieving geometric convergence for distributed optimization over time-varying graphs, SIAM J. Optim., 27, 4, 2597-2633 (2017) · Zbl 1387.90189 · doi:10.1137/16M1084316
[200]	Munos, R., Performance bounds in \(\ell_p\)-norm for approximate value iteration, SIAM J. Control Optim., 46, 2, 541-561 (2007) · Zbl 1356.90159 · doi:10.1137/040614384
[201]	Munos, R.; Szepesvári, C., Finite-time bounds for fitted value iteration, J. Mach. Learn. Res., 9, May, 815-857 (2008) · Zbl 1225.68203
[202]	Antos, A.; Szepesvári, C.; Munos, R., Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Mach. Learn., 71, 1, 89-129 (2008) · Zbl 1143.68516 · doi:10.1007/s10994-007-5038-2
[203]	Farahmand, A.M., Szepesvári, C., Munos, R.: Error propagation for approximate policy and value iteration. In: Advances in Neural Information Processing Systems, pp. 568-576 (2010)
[204]	Cassano, L., Yuan, K., Sayed, A.H.: Multi-agent fully decentralized off-policy learning with linear convergence rates (2018). arXiv preprint arXiv:1810.07792
[205]	Qu, G.; Li, N., Harnessing smoothness to accelerate distributed optimization, IEEE Trans. Control Netw. Syst., 5, 3, 1245-1260 (2017) · Zbl 1515.93111 · doi:10.1109/TCNS.2017.2698261
[206]	Schmidt, M.; Le Roux, N.; Bach, F., Minimizing finite sums with the stochastic average gradient, Math. Program., 162, 1-2, 83-112 (2017) · Zbl 1358.90073 · doi:10.1007/s10107-016-1030-6
[207]	Ying, B., Yuan, K., Sayed, A.H.: Convergence of variance-reduced learning under random reshuffling. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2286-2290 (2018)
[208]	Singh, SP; Sutton, RS, Reinforcement learning with replacing eligibility traces, Mach. Learn., 22, 1-3, 123-158 (1996) · Zbl 1099.68700
[209]	Bhandari, J., Russo, D., Singal, R.: A finite time analysis of temporal difference learning with linear function approximation. In: Conference On Learning Theory, pp. 1691-1692 (2018)
[210]	Srikant, R., Ying, L.: Finite-time error bounds for linear stochastic approximation and TD learning. In: Conference on Learning Theory, pp. 2803-2830 (2019)
[211]	Stanković, M.S., Stanković, S.S.: Multi-agent temporal-difference learning with linear function approximation: weak convergence under time-varying network topologies. In: IEEE American Control Conference, pp. 167-172 (2016)
[212]	Stanković, MS; Ilić, N.; Stanković, SS, Distributed stochastic approximation: weak convergence and network design, IEEE Trans. Autom. Control, 61, 12, 4069-4074 (2016) · Zbl 1359.90032 · doi:10.1109/TAC.2016.2545098
[213]	Zhang, H.; Jiang, H.; Luo, Y.; Xiao, G., Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method, IEEE Trans. Ind. Electron., 64, 5, 4091-4100 (2016) · doi:10.1109/TIE.2016.2542134
[214]	Zhang, Q., Zhao, D., Lewis, F.L.: Model-free reinforcement learning for fully cooperative multi-agent graphical games. In: International Joint Conference on Neural Networks, pp. 1-6 (2018)
[215]	Bernstein, DS; Amato, C.; Hansen, EA; Zilberstein, S., Policy iteration for decentralized control of Markov decision processes, J. Artif. Intell. Res., 34, 89-132 (2009) · Zbl 1182.68216 · doi:10.1613/jair.2667
[216]	Amato, C.; Bernstein, DS; Zilberstein, S., Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs, Auton. Agents Multi-Agent Syst., 21, 3, 293-320 (2010) · doi:10.1007/s10458-009-9103-z
[217]	Liu, M., Amato, C., Liao, X., Carin, L., How, J.P.: Stick-breaking policy learning in Dec-POMDPs. In: International Joint Conference on Artificial Intelligence (2015)
[218]	Dibangoye, JS; Amato, C.; Buffet, O.; Charpillet, F., Optimally solving Dec-POMDPs as continuous-state MDPs, J. Artif. Intell. Res., 55, 443-497 (2016) · Zbl 1352.68220 · doi:10.1613/jair.4623
[219]	Wu, F., Zilberstein, S., Chen, X.: Rollout sampling policy iteration for decentralized POMDPs. In: Conference on Uncertainty in Artificial Intelligence (2010)
[220]	Wu, F., Zilberstein, S., Jennings, N.R.: Monte-Carlo expectation maximization for decentralized POMDPs. In: International Joint Conference on Artificial Intelligence (2013)
[221]	Best, G., Cliff, O.M., Patten, T., Mettu, R.R., Fitch, R.: Dec-MCTS: decentralized planning for multi-robot active perception. Int. J. Robot. Res. 1-22 (2018)
[222]	Amato, C., Zilberstein, S.: Achieving goals in decentralized POMDPs. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 593-600 (2009)
[223]	Banerjee, B., Lyle, J., Kraemer, L., Yellamraju, R.: Sample bounded distributed reinforcement learning for decentralized POMDPs. In: AAAI Conference on Artificial Intelligence (2012)
[224]	Nayyar, A.; Mahajan, A.; Teneketzis, D., Decentralized stochastic control with partial history sharing: a common information approach, IEEE Trans. Autom. Control, 58, 7, 1644-1658 (2013) · Zbl 1369.90187 · doi:10.1109/TAC.2013.2239000
[225]	Arabneydi, J., Mahajan, A.: Reinforcement learning in decentralized stochastic control systems with partial history sharing. In: IEEE American Control Conference, pp. 5449-5456 (2015)
[226]	Papadimitriou, C.H.: On inefficient proofs of existence and complexity classes. In: Annals of Discrete Mathematics, vol. 51, pp. 245-250. Elsevier (1992) · Zbl 0798.68058
[227]	Daskalakis, C.; Goldberg, PW; Papadimitriou, CH, The complexity of computing a Nash equilibrium, SIAM J. Comput., 39, 1, 195-259 (2009) · Zbl 1185.91019 · doi:10.1137/070699652
[228]	Von Neumann, J., Morgenstern, O., Kuhn, H.W.: Theory of Games and Economic Behavior (commemorative edition). Princeton University Press, Princeton (2007) · Zbl 1112.91002
[229]	Vanderbei, R.J., et al.: Linear Programming. Springer, Berlin (2015)
[230]	Hoffman, AJ; Karp, RM, On nonterminating stochastic games, Manag. Sci., 12, 5, 359-370 (1966) · Zbl 0136.14303 · doi:10.1287/mnsc.12.5.359
[231]	Van Der Wal, J., Discounted Markov games: generalized policy iteration method, J. Optim. Theory Appl., 25, 1, 125-138 (1978) · Zbl 0352.90071 · doi:10.1007/BF00933260
[232]	Rao, SS; Chandrasekaran, R.; Nair, K., Algorithms for discounted stochastic games, J. Optim. Theory Appl., 11, 6, 627-637 (1973) · Zbl 0245.93024 · doi:10.1007/BF00935562
[233]	Patek, S.D.: Stochastic and shortest path games: theory and algorithms. Ph.D. thesis, Massachusetts Institute of Technology (1997)
[234]	Hansen, TD; Miltersen, PB; Zwick, U., Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor, J. ACM, 60, 1, 1 (2013) · Zbl 1281.91019 · doi:10.1145/2432622.2432623
[235]	Lagoudakis, M.G., Parr, R.: Value function approximation in zero-sum Markov games. In: Conference on Uncertainty in Artificial Intelligence, pp. 283-292 (2002)
[236]	Zou, S., Xu, T., Liang, Y.: Finite-sample analysis for SARSA with linear function approximation (2019). arXiv preprint arXiv:1902.02234
[237]	Sutton, R.S., Barto, A.G.: A temporal-difference model of classical conditioning. In: Proceedings of the Annual Conference of the Cognitive Science Society, pp. 355-378 (1987)
[238]	Al-Tamimi, A.; Abu-Khalaf, M.; Lewis, FL, Adaptive critic designs for discrete-time zero-sum games with application to \(\cal{H}_\infty\) control, IEEE Trans. Syst. Man Cybern. Part B, 37, 1, 240-247 (2007) · Zbl 1137.93321 · doi:10.1109/TSMCB.2006.880135
[239]	Al-Tamimi, A.; Lewis, FL; Abu-Khalaf, M., Model-free Q-learning designs for linear discrete-time zero-sum games with application to \(\cal{H}_\infty\) control, Automatica, 43, 3, 473-481 (2007) · Zbl 1137.93321 · doi:10.1016/j.automatica.2006.09.019
[240]	Farahmand, A.M., Ghavamzadeh, M., Szepesvári, C., Mannor, S.: Regularized policy iteration with nonparametric function spaces. J. Mach. Learn. Res. 17(1), 4809-4874 (2016) · Zbl 1392.68345
[241]	Yang, Z., Xie, Y., Wang, Z.: A theoretical analysis of deep Q-learning (2019). arXiv preprint arXiv:1901.00137
[242]	Jia, Z., Yang, L.F., Wang, M.: Feature-based Q-learning for two-player stochastic games (2019). arXiv preprint arXiv:1906.00423
[243]	Sidford, A., Wang, M., Wu, X., Yang, L., Ye, Y.: Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In: Advances in Neural Information Processing Systems, pp. 5186-5196 (2018)
[244]	Wei, C.Y., Hong, Y.T., Lu, C.J.: Online reinforcement learning in stochastic games. In: Advances in Neural Information Processing Systems, pp. 4987-4997 (2017)
[245]	Auer, P., Ortner, R.: Logarithmic online regret bounds for undiscounted reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 49-56 (2007)
[246]	Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11, 1563-1600 (2010) · Zbl 1242.68229
[247]	Koller, D.; Megiddo, N.; von Stengel, B., Fast algorithms for finding randomized strategies in game trees, Computing, 750, 759 (1994) · Zbl 1345.68258
[248]	Von Stengel, B., Efficient computation of behavior strategies, Games Econ. Behav., 14, 2, 220-246 (1996) · Zbl 0867.90131 · doi:10.1006/game.1996.0050
[249]	Koller, D.; Megiddo, N.; Von Stengel, B., Efficient computation of equilibria for extensive two-person games, Games Econ. Behav., 14, 2, 247-259 (1996) · Zbl 0859.90127 · doi:10.1006/game.1996.0051
[250]	Von Stengel, B., Computing equilibria for two-person games, Handbook of Game Theory with Economic Applications, 3, 1723-1759 (2002) · doi:10.1016/S1574-0005(02)03008-4
[251]	Parr, R., Russell, S.: Approximating optimal policies for partially observable stochastic domains. In: International Joint Conference on Artificial Intelligence, pp. 1088-1094 (1995)
[252]	Rodriguez, A.C., Parr, R., Koller, D.: Reinforcement learning using approximate belief states. In: Advances in Neural Information Processing Systems, pp. 1036-1042 (2000)
[253]	Hauskrecht, M., Value-function approximations for partially observable Markov decision processes, J. Artif. Intell. Res., 13, 33-94 (2000) · Zbl 0946.68131 · doi:10.1613/jair.678
[254]	Buter, B.J.: Dynamic programming for extensive form games with imperfect information. Ph.D. thesis, Universiteit van Amsterdam (2012)
[255]	Cowling, PI; Powley, EJ; Whitehouse, D., Information set Monte Carlo tree search, IEEE Trans. Comput. Intell. AI Games, 4, 2, 120-143 (2012) · doi:10.1109/TCIAIG.2012.2200894
[256]	Teraoka, K.; Hatano, K.; Takimoto, E., Efficient sampling method for Monte Carlo tree search problem, IEICE Trans. Inf. Syst., 97, 3, 392-398 (2014) · doi:10.1587/transinf.E97.D.392
[257]	Whitehouse, D.: Monte Carlo tree search for games with hidden information and uncertainty. Ph.D. thesis, University of York (2014)
[258]	Kaufmann, E., Koolen, W.M.: Monte-Carlo tree search by best arm identification. In: Advances in Neural Information Processing Systems, pp. 4897-4906 (2017)
[259]	Hannan, J., Approximation to Bayes risk in repeated play, Contrib. Theory Games, 3, 97-139 (1957) · Zbl 0078.32804
[260]	Brown, GW, Iterative solution of games by fictitious play, Act. Anal. Prod. Allo., 13, 1, 374-376 (1951) · Zbl 0045.09902
[261]	Robinson, J.: An iterative method of solving a game. Ann. Math. 296-301 (1951) · Zbl 0045.08203
[262]	Benaïm, M.; Hofbauer, J.; Sorin, S., Stochastic approximations and differential inclusions, SIAM J. Control Optim., 44, 1, 328-348 (2005) · Zbl 1087.62091 · doi:10.1137/S0363012904439301
[263]	Hart, S.; Mas-Colell, A., A general class of adaptive strategies, J. Econ. Theory, 98, 1, 26-54 (2001) · Zbl 0994.91007 · doi:10.1006/jeth.2000.2746
[264]	Monderer, D.; Samet, D.; Sela, A., Belief affirming in learning processes, J. Econ. Theory, 73, 2, 438-452 (1997) · Zbl 0886.90192 · doi:10.1006/jeth.1996.2245
[265]	Viossat, Y.; Zapechelnyuk, A., No-regret dynamics and fictitious play, J. Econ. Theory, 148, 2, 825-842 (2013) · Zbl 1275.91019 · doi:10.1016/j.jet.2012.07.003
[266]	Kushner, HJ; Yin, GG, Stochastic Approximation and Recursive Algorithms and Applications (2003), New York: Springer, New York · Zbl 1026.62084
[267]	Fudenberg, D.; Levine, DK, Consistency and cautious fictitious play, J. Econ. Dyn. Control, 19, 5-7, 1065-1089 (1995) · Zbl 0900.90423 · doi:10.1016/0165-1889(94)00819-4
[268]	Hofbauer, J.; Sandholm, WH, On the global convergence of stochastic fictitious play, Econometrica, 70, 6, 2265-2294 (2002) · Zbl 1141.91336 · doi:10.1111/1468-0262.00376
[269]	Leslie, DS; Collins, EJ, Generalised weakened fictitious play, Games Econ. Behav., 56, 2, 285-298 (2006) · Zbl 1177.91044 · doi:10.1016/j.geb.2005.08.005
[270]	Benaïm, M.; Faure, M., Consistency of vanishingly smooth fictitious play, Math. Oper. Res., 38, 3, 437-450 (2013) · Zbl 1297.91027 · doi:10.1287/moor.1120.0568
[271]	Li, Z.; Tewari, A., Sampled fictitious play is Hannan consistent, Games Econ. Behav., 109, 401-412 (2018) · Zbl 1390.91043 · doi:10.1016/j.geb.2018.01.005
[272]	Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 6(Apr), 503-556 (2005) · Zbl 1222.68193
[273]	Heinrich, J., Silver, D.: Self-play Monte-Carlo tree search in computer Poker. In: Workshops at AAAI Conference on Artificial Intelligence (2014)
[274]	Browne, CB; Powley, E.; Whitehouse, D.; Lucas, SM; Cowling, PI; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; Colton, S., A survey of Monte Carlo tree search methods, IEEE Trans. Comput. Intell. AI Games, 4, 1, 1-43 (2012) · doi:10.1109/TCIAIG.2012.2186810
[275]	Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, Cambridge (2008) · Zbl 1159.60002
[276]	Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006) · Zbl 1114.91001
[277]	Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, RE, The nonstochastic multiarmed bandit problem, SIAM J. Comput., 32, 1, 48-77 (2002) · Zbl 1029.68087 · doi:10.1137/S0097539701398375
[278]	Vovk, V.G.: Aggregating strategies. In: Proceedings of Computational Learning Theory (1990)
[279]	Littlestone, N.; Warmuth, MK, The weighted majority algorithm, Inf. Comput., 108, 2, 212-261 (1994) · Zbl 0804.68121 · doi:10.1006/inco.1994.1009
[280]	Freund, Y.; Schapire, RE, Adaptive game playing using multiplicative weights, Games Econ. Behav., 29, 1-2, 79-103 (1999) · Zbl 0964.91007 · doi:10.1006/game.1999.0738
[281]	Hart, S.; Mas-Colell, A., A simple adaptive procedure leading to correlated equilibrium, Econometrica, 68, 5, 1127-1150 (2000) · Zbl 1020.91003 · doi:10.1111/1468-0262.00153
[282]	Lanctot, M., Waugh, K., Zinkevich, M., Bowling, M.: Monte Carlo sampling for regret minimization in extensive games. In: Advances in Neural Information Processing Systems, pp. 1078-1086 (2009)
[283]	Burch, N., Lanctot, M., Szafron, D., Gibson, R.G.: Efficient Monte Carlo counterfactual regret minimization in games with many player actions. In: Advances in Neural Information Processing Systems, pp. 1880-1888 (2012)
[284]	Gibson, R., Lanctot, M., Burch, N., Szafron, D., Bowling, M.: Generalized sampling and variance in counterfactual regret minimization. In: AAAI Conference on Artificial Intelligence (2012)
[285]	Johanson, M., Bard, N., Lanctot, M., Gibson, R., Bowling, M.: Efficient Nash equilibrium approximation through Monte Carlo counterfactual regret minimization. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 837-846 (2012)
[286]	Lisỳ, V., Lanctot, M., Bowling, M.: Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 27-36 (2015)
[287]	Schmid, M., Burch, N., Lanctot, M., Moravcik, M., Kadlec, R., Bowling, M.: Variance reduction in Monte Carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 2157-2164 (2019)
[288]	Waugh, K., Morrill, D., Bagnell, J.A., Bowling, M.: Solving games with functional regret estimation. In: AAAI Conference on Artificial Intelligence (2015)
[289]	Morrill, D.: Using regret estimation to solve games compactly. Ph.D. thesis, University of Alberta (2016)
[290]	Brown, N., Lerer, A., Gross, S., Sandholm, T.: Deep counterfactual regret minimization. In: International Conference on Machine Learning, pp. 793-802 (2019)
[291]	Brown, N., Sandholm, T.: Regret-based pruning in extensive-form games. In: Advances in Neural Information Processing Systems, pp. 1972-1980 (2015)
[292]	Brown, N., Kroer, C., Sandholm, T.: Dynamic thresholding and pruning for regret minimization. In: AAAI Conference on Artificial Intelligence (2017)
[293]	Brown, N., Sandholm, T.: Reduced space and faster convergence in imperfect-information games via pruning. In: International Conference on Machine Learning, pp. 596-604 (2017)
[294]	Tammelin, O.: Solving large imperfect information games using CFR+ (2014). arXiv preprint arXiv:1407.5042
[295]	Tammelin, O., Burch, N., Johanson, M., Bowling, M.: Solving heads-up limit Texas Hold’em. In: International Joint Conference on Artificial Intelligence (2015)
[296]	Burch, N.; Moravcik, M.; Schmid, M., Revisiting CFR+ and alternating updates, J. Artif. Intell. Res., 64, 429-443 (2019) · Zbl 1477.68558 · doi:10.1613/jair.1.11370
[297]	Zhou, Y., Ren, T., Li, J., Yan, D., Zhu, J.: Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information (2018). arXiv preprint arXiv:1810.04433
[298]	Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: International Conference on Machine Learning, pp. 928-936 (2003)
[299]	Lockhart, E., Lanctot, M., Pérolat, J., Lespiau, J.B., Morrill, D., Timbers, F., Tuyls, K.: Computing approximate equilibria in sequential adversarial games by exploitability descent (2019). arXiv preprint arXiv:1903.05614
[300]	Johanson, M., Bard, N., Burch, N., Bowling, M.: Finding optimal abstract strategies in extensive-form games. In: AAAI Conference on Artificial Intelligence, pp. 1371-1379 (2012)
[301]	Schaeffer, M.S., Sturtevant, N., Schaeffer, J.: Comparing UCT versus CFR in simultaneous games (2009)
[302]	Lanctot, M., Lisỳ, V., Winands, M.H.: Monte Carlo tree search in simultaneous move games with applications to Goofspiel. In: Workshop on Computer Games, pp. 28-43 (2013)
[303]	Lisỳ, V., Kovarik, V., Lanctot, M., Bošanskỳ, B.: Convergence of Monte Carlo tree search in simultaneous move games. In: Advances in Neural Information Processing Systems, pp. 2112-2120 (2013)
[304]	Tak, M.J., Lanctot, M., Winands, M.H.: Monte Carlo tree search variants for simultaneous move games. In: IEEE Conference on Computational Intelligence and Games, pp. 1-8 (2014)
[305]	Kovařík, V., Lisỳ, V.: Analysis of Hannan consistent selection for Monte Carlo tree search in simultaneous move games (2018). arXiv preprint arXiv:1804.09045 · Zbl 1440.68219
[306]	Mazumdar, E.V., Jordan, M.I., Sastry, S.S.: On finding local Nash equilibria (and only local Nash equilibria) in zero-sum games (2019). arXiv preprint arXiv:1901.00838
[307]	Bu, J., Ratliff, L.J., Mesbahi, M.: Global convergence of policy gradient for sequential zero-sum linear quadratic dynamic games (2019). arXiv preprint arXiv:1911.04672
[308]	Mescheder, L., Nowozin, S., Geiger, A.: The numerics of GANs. In: Advances in Neural Information Processing Systems, pp. 1825-1835 (2017)
[309]	Adolphs, L., Daneshmand, H., Lucchi, A., Hofmann, T.: Local saddle point optimization: a curvature exploitation approach (2018). arXiv preprint arXiv:1805.05751
[310]	Daskalakis, C., Panageas, I.: The limit points of (optimistic) gradient descent in min-max optimization. In: Advances in Neural Information Processing Systems, pp. 9236-9246 (2018)
[311]	Mertikopoulos, P., Zenati, H., Lecouat, B., Foo, C.S., Chandrasekhar, V., Piliouras, G.: Optimistic mirror descent in saddle-point problems: going the extra (gradient) mile. In: International Conference on Learning Representations (2019)
[312]	Fiez, T., Chasnov, B., Ratliff, L.J.: Convergence of learning dynamics in Stackelberg games (2019). arXiv preprint arXiv:1906.01217
[313]	Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., Graepel, T.: The mechanics of n-player differentiable games. In: International Conference on Machine Learning, pp. 363-372 (2018) · Zbl 1489.91032
[314]	Sanjabi, M., Razaviyayn, M., Lee, J.D.: Solving non-convex non-concave min-max games under Polyak-Łojasiewicz condition (2018). arXiv preprint arXiv:1812.02878
[315]	Nouiehed, M., Sanjabi, M., Lee, J.D., Razaviyayn, M.: Solving a class of non-convex min-max games using iterative first order methods (2019). arXiv preprint arXiv:1902.08297
[316]	Mazumdar, E., Ratliff, L.J., Jordan, M.I., Sastry, S.S.: Policy-gradient algorithms have no guarantees of convergence in continuous action and state multi-agent settings (2019). arXiv preprint arXiv:1907.03712
[317]	Chen, X.; Deng, X.; Teng, SH, Settling the complexity of computing two-player Nash equilibria, J. ACM, 56, 3, 14 (2009) · Zbl 1325.68095 · doi:10.1145/1516512.1516516
[318]	Greenwald, A., Hall, K., Serrano, R.: Correlated Q-learning. In: International Conference on Machine Learning, pp. 242-249 (2003)
[319]	Aumann, RJ, Subjectivity and correlation in randomized strategies, J. Math. Econ., 1, 1, 67-96 (1974) · Zbl 0297.90106 · doi:10.1016/0304-4068(74)90037-8
[320]	Perolat, J., Strub, F., Piot, B., Pietquin, O.: Learning Nash equilibrium for general-sum Markov games from batch data. In: International Conference on Artificial Intelligence and Statistics, pp. 232-241 (2017)
[321]	Maillard, O.A., Munos, R., Lazaric, A., Ghavamzadeh, M.: Finite-sample analysis of Bellman residual minimization. In: Asian Conference on Machine Learning, pp. 299-314 (2010)
[322]	Letcher, A.; Balduzzi, D.; Racanière, S.; Martens, J.; Foerster, JN; Tuyls, K.; Graepel, T., Differentiable game mechanics, J. Mach. Learn. Res., 20, 84, 1-40 (2019) · Zbl 1489.91032
[323]	Chasnov, B., Ratliff, L.J., Mazumdar, E., Burden, S.A.: Convergence analysis of gradient-based learning with non-uniform learning rates in non-cooperative multi-agent settings (2019). arXiv preprint arXiv:1906.00731
[324]	Hart, S.; Mas-Colell, A., Uncoupled dynamics do not lead to Nash equilibrium, Am. Econ. Rev., 93, 5, 1830-1836 (2003) · doi:10.1257/000282803322655581
[325]	Saldi, N.; Başar, T.; Raginsky, M., Markov-Nash equilibria in mean-field games with discounted cost, SIAM J. Control Optim., 56, 6, 4256-4287 (2018) · Zbl 1418.91069 · doi:10.1137/17M1112583
[326]	Saldi, N., Başar, T., Raginsky, M.: Approximate Nash equilibria in partially observed stochastic games with mean-field interactions. Math. Oper. Res. (2019) · Zbl 1437.91060
[327]	Saldi, N.: Discrete-time average-cost mean-field games on Polish spaces (2019). arXiv preprint arXiv:1908.08793 · Zbl 1448.91027
[328]	Saldi, N., Başar, T., Raginsky, M.: Discrete-time risk-sensitive mean-field games (2018). arXiv preprint arXiv:1808.03929 · Zbl 1455.91040
[329]	Guo, X., Hu, A., Xu, R., Zhang, J.: Learning mean-field games (2019). arXiv preprint arXiv:1901.09585
[330]	Fu, Z., Yang, Z., Chen, Y., Wang, Z.: Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games (2019). arXiv preprint arXiv:1910.07498
[331]	Hadikhanloo, S., Silva, F.J.: Finite mean field games: fictitious play and convergence to a first order continuous mean field game. J. Math. Pures Appl. (2019) · Zbl 1427.35288
[332]	Elie, R., Pérolat, J., Laurière, M., Geist, M., Pietquin, O.: Approximate fictitious play for mean field games (2019). arXiv preprint arXiv:1907.02633
[333]	Anahtarci, B., Kariksiz, C.D., Saldi, N.: Value iteration algorithm for mean-field games (2019). arXiv preprint arXiv:1909.01758 · Zbl 1451.91014
[334]	Zaman, M.A.u., Zhang, K., Miehling, E., Başar, T.: Approximate equilibrium computation for discrete-time linear-quadratic mean-field games. Submitted to IEEE American Control Conference (2020)
[335]	Yang, B., Liu, M.: Keeping in touch with collaborative UAVs: a deep reinforcement learning approach. In: International Joint Conference on Artificial Intelligence, pp. 562-568 (2018)
[336]	Pham, H.X., La, H.M., Feil-Seifer, D., Nefian, A.: Cooperative and distributed reinforcement learning of drones for field coverage (2018). arXiv preprint arXiv:1803.07250
[337]	Tožička, J., Szulyovszky, B., de Chambrier, G., Sarwal, V., Wani, U., Gribulis, M.: Application of deep reinforcement learning to UAV fleet control. In: SAI Intelligent Systems Conference, pp. 1169-1177 (2018)
[338]	Shamsoshoara, A., Khaledi, M., Afghah, F., Razi, A., Ashdown, J.: Distributed cooperative spectrum sharing in UAV networks using multi-agent reinforcement learning. In: IEEE Annual Consumer Communications & Networking Conference, pp. 1-6 (2019)
[339]	Cui, J., Liu, Y., Nallanathan, A.: The application of multi-agent reinforcement learning in UAV networks. In: IEEE International Conference on Communications Workshops, pp. 1-6 (2019)
[340]	Qie, H., Shi, D., Shen, T., Xu, X., Li, Y., Wang, L.: Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning. IEEE Access (2019)
[341]	Hochreiter, S.; Schmidhuber, J., Long short-term memory, Neural Comput., 9, 8, 1735-1780 (1997) · doi:10.1162/neco.1997.9.8.1735
[342]	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998-6008 (2017)
[343]	Hausknecht, M., Stone, P.: Deep recurrent Q-learning for partially observable MDPs. In: 2015 AAAI Fall Symposium Series (2015)
[344]	Jorge, E., Kågebäck, M., Johansson, F.D., Gustavsson, E.: Learning to play guess who? and inventing a grounded language as a consequence (2016). arXiv preprint arXiv:1611.03218
[345]	Sukhbaatar, S., Fergus, R., et al.: Learning multiagent communication with backpropagation. In: Advances in Neural Information Processing Systems, pp. 2244-2252 (2016)
[346]	Havrylov, S., Titov, I.: Emergence of language with multi-agent games: learning to communicate with sequences of symbols. In: Advances in Neural Information Processing Systems, pp. 2149-2159 (2017)
[347]	Das, A., Kottur, S., Moura, J.M., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2951-2960 (2017)
[348]	Peng, P., Wen, Y., Yang, Y., Yuan, Q., Tang, Z., Long, H., Wang, J.: Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games (2017). arXiv preprint arXiv:1703.10069
[349]	Mordatch, I., Abbeel, P.: Emergence of grounded compositional language in multi-agent populations. In: AAAI Conference on Artificial Intelligence (2018)
[350]	Jiang, J., Lu, Z.: Learning attentional communication for multi-agent cooperation. In: Advances in Neural Information Processing Systems, pp. 7254-7264 (2018)
[351]	Jiang, J., Dun, C., Lu, Z.: Graph convolutional reinforcement learning for multi-agent cooperation. 2(3) (2018). arXiv preprint arXiv:1810.09202
[352]	Celikyilmaz, A., Bosselut, A., He, X., Choi, Y.: Deep communicating agents for abstractive summarization (2018). arXiv preprint arXiv:1803.10357
[353]	Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., Pineau, J.: TarMAC: targeted multi-agent communication (2018). arXiv preprint arXiv:1810.11187
[354]	Lazaridou, A., Hermann, K.M., Tuyls, K., Clark, S.: Emergence of linguistic communication from referential games with symbolic and pixel input (2018). arXiv preprint arXiv:1804.03984
[355]	Cogswell, M., Lu, J., Lee, S., Parikh, D., Batra, D.: Emergence of compositional language with deep generational transmission (2019). arXiv preprint arXiv:1904.09067
[356]	Allis, L.: Searching for solutions in games and artificial intelligence. Ph.D. thesis, Maastricht University (1994)
[357]	Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097-1105 (2012)
[358]	Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; Lillicrap, T.; Simonyan, K.; Hassabis, D., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play, Science, 362, 6419, 1140-1144 (2018) · Zbl 1433.68320 · doi:10.1126/science.aar6404
[359]	Billings, D., Davidson, A., Schaeffer, J., Szafron, D.: The challenge of Poker. Artif. Intell. 134(1-2), 201-240 (2002) · Zbl 0982.68125
[360]	Kuhn, H.W.: A simplified two-person Poker. Contrib. Theory Games 1, 97-103 (1950) · Zbl 0041.25601
[361]	Southey, F., Bowling, M., Larson, B., Piccione, C., Burch, N., Billings, D., Rayner, C.: Bayes’ bluff: opponent modelling in Poker. In: Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, pp. 550-558. AUAI Press (2005)
[362]	Bowling, M., Burch, N., Johanson, M., Tammelin, O.: Heads-up limit hold’em Poker is solved. Science 347(6218), 145-149 (2015)
[363]	Heinrich, J., Silver, D.: Smooth UCT search in computer Poker. In: 24th International Joint Conference on Artificial Intelligence (2015)
[364]	Moravčík, M., Schmid, M., Burch, N., Lisỳ, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., Bowling, M.: Deepstack: expert-level artificial intelligence in heads-up no-limit Poker. Science 356(6337), 508-513 (2017) · Zbl 1403.68202
[365]	Brown, N., Sandholm, T.: Superhuman AI for heads-up no-limit Poker: Libratus beats top professionals. Science 359(6374), 418-424 (2018) · Zbl 1415.68163
[366]	Burch, N., Johanson, M., Bowling, M.: Solving imperfect information games using decomposition. In: 28th AAAI Conference on Artificial Intelligence (2014)
[367]	Moravcik, M., Schmid, M., Ha, K., Hladik, M., Gaukrodger, S.J.: Refining subgames in large imperfect information games. In: 30th AAAI Conference on Artificial Intelligence (2016)
[368]	Brown, N., Sandholm, T.: Safe and nested subgame solving for imperfect-information games. In: Advances in Neural Information Processing Systems, pp. 689-699 (2017)
[369]	Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A.S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., et al.: Starcraft II: a new challenge for reinforcement learning (2017). arXiv preprint arXiv:1708.04782
[370]	Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al.: Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 1-5 (2019)
[371]	Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928-1937 (2016)
[372]	Lerer, A., Peysakhovich, A.: Maintaining cooperation in complex social dilemmas using deep reinforcement learning (2017). arXiv preprint arXiv:1707.01068
[373]	Hughes, E., Leibo, J.Z., Phillips, M., Tuyls, K., Dueñez-Guzman, E., Castañeda, A.G., Dunning, I., Zhu, T., McKee, K., Koster, R., et al.: Inequity aversion improves cooperation in intertemporal social dilemmas. In: Advances in Neural Information Processing Systems, pp. 3326-3336 (2018)
[374]	Cai, Q., Yang, Z., Lee, J.D., Wang, Z.: Neural temporal-difference learning converges to global optima (2019). arXiv preprint arXiv:1905.10027
[375]	Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: implicit acceleration by overparameterization (2018). arXiv preprint arXiv:1802.06509
[376]	Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems, pp. 8157-8166 (2018)
[377]	Brafman, RI; Tennenholtz, M., A near-optimal polynomial time algorithm for learning in certain classes of stochastic games, Artif. Intell., 121, 1-2, 31-47 (2000) · Zbl 0951.68119 · doi:10.1016/S0004-3702(00)00039-4
[378]	Brafman, R.I., Tennenholtz, M.: R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213-231 (2002) · Zbl 1088.68694
[379]	Tu, S., Recht, B.: The gap between model-based and model-free methods on the linear quadratic regulator: an asymptotic viewpoint (2018). arXiv preprint arXiv:1812.03565
[380]	Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J.: Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In: Conference on Learning Theory, pp. 2898-2933 (2019)
[381]	Lin, Q., Liu, M., Rafique, H., Yang, T.: Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality (2018). arXiv preprint arXiv:1810.10207
[382]	García, J.; Fernández, F., A comprehensive survey on safe reinforcement learning, J. Mach. Learn. Res., 16, 1, 1437-1480 (2015) · Zbl 1351.68209
[383]	Chen, Y., Su, L., Xu, J.: Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proc. ACM Meas. Anal. Comput. Syst. 1(2), 44 (2017)
[384]	Yin, D., Chen, Y., Ramchandran, K., Bartlett, P.: Byzantine-robust distributed learning: towards optimal statistical rates (2018). arXiv preprint arXiv:1803.01498

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.