HSVI Can Solve Zero-Sum Partially Observable Stochastic Games

226 Accesses
1 Citation
Explore all metrics

Abstract

State-of-the-art methods for solving 2-player zero-sum imperfect information games rely on linear programming or regret minimization, though not on dynamic programming (DP) or heuristic search (HS), while the latter are often at the core of state-of-the-art solvers for other sequential decision-making problems. In partially observable or collaborative settings (e.g., POMDPs and Dec-POMDPs), DP and HS require introducing an appropriate statistic that induces a fully observable problem as well as bounding (convex) approximators of the optimal value function. This approach has succeeded in some subclasses of 2-player zero-sum partially observable stochastic games (zs-POSGs) as well, but how to apply it in the general case still remains an open question. We answer it by (i) rigorously defining an equivalent game to work with, (ii) proving mathematical properties of the optimal value function that allow deriving bounds that come with solution strategies, (iii) proposing for the first time an HSVI-like solver that provably converges to an $\epsilon $-optimal solution in finite time, and (iv) empirically analyzing it. This opens the door to a novel family of promising approaches complementing those relying on linear programming or iterative methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games

Article 21 January 2023

A convex programming-based algorithm for mean payoff stochastic games with perfect information

Article 07 April 2017

Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games

Article 05 January 2022

Notes

Note: POSGs are equivalent to the large class of “well-behaved” EFGs as defined by [28].
As discussed in the following, due to global-consistency issues, this game is not equivalent to the original zs-POSG.
We use (i) “Markov game” instead of “stochastic game” because the dynamics are not stochastic, and (ii) “partially observable stochastic game” to stick with the literature.
Note that the same distinction between two possible transformations exists for Dec-POMDPs that are translated either in Occupancy MDPs or in Non-Observable MDPs [14].
In the Dec-POMDP setting, assuming that decision rules are public is valid because agents collaborate.
The proof process is similar. The only difference lies in the space at hand, but without any impact on the resulting formulas.

References

Åström KJ (1965) Optimal control of Markov processes with incomplete state information. J Math Anal Appl 10(1):174–205
Article MathSciNet Google Scholar
Basilico N, De Nittis G, Gatti N (2016) A security game combining patrolling and alarm-triggered responses under spatial and detection uncertainties. In: Proceedings of the thirtieth AAAI conference on artificial intelligence
Basu A, Stettner L (2015) Finite- and infinite-horizon Shapley games with nonsymmetric partial observation. SIAM J Control Optim 53(6):3584–3619
Article MathSciNet Google Scholar
Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complexity of decentralized control of Markov decision processes. Math Oper Res 27(4):819–840
Article MathSciNet Google Scholar
Bono G, Dibangoye J, Matignon L, Pereyron F, Simonin O (2018) Cooperative multi-agent policy gradient. In: Proceedings of the twenty-eight European conference on machine learning
Bošanský B, Kiekintveld C, Lisý V, Pěchouček M (2014) An exact double-oracle algorithm for zero-sum extensive-form games with imperfect information. J Artif Intell Res 51:829–866. https://doi.org/10.1613/jair.4477
Article MathSciNet Google Scholar
Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: libratus beats top professionals. Science 359(6374):418–424
Article MathSciNet Google Scholar
Brown N, Kroer C, Sandholm T (2017) Dynamic thresholding and pruning for regret minimization. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Brown N, Bakhtin A, Lerer A, Gong Q (2020) Combining deep reinforcement learning and search for imperfect-information games. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, vol 33, pp 17057–17069
Burch N, Johanson M, Bowling M (2014) Solving imperfect information games using decomposition. In: Twenty-eighth AAAI conference on artificial intelligence
Burch N, Moravcik M, Schmid M (2019) Revisiting CFR+ and alternating updates. J Artif Intell Res 64:429–443
Article MathSciNet Google Scholar
Chatterjee K, Doyen L (2014) Partial-observation stochastic games: how to win when belief fails. ACM Trans Comput Log 15(2):16
Article MathSciNet Google Scholar
Cole HL, Kocherlakota N (2001) Dynamic games with hidden actions and hidden states. J Econ Theory 98(1):114–126
Article MathSciNet Google Scholar
Dibangoye J, Amato C, Buffet O, Charpillet F (2016) Optimally solving Dec-POMDPs as continuous-state MDPs. J Artif Intell Res 55:443–497
Article MathSciNet Google Scholar
Ghosh MK, McDonald DR, Sinha S (2004) Zero-sum stochastic games with partial information. J Optim Theory Appl 121(1):99–118
Article MathSciNet Google Scholar
Hansen EA, Zhou R (2007) Anytime heuristic search. J Artif Intell Res 28:267–297. https://doi.org/10.1613/jair.2096
Article MathSciNet Google Scholar
Hansen EA, Bernstein DS, Zilberstein S (2004) Dynamic programming for partially observable stochastic games. In: AAAI, vol 4, pp 709–715
Hoda S, Gilpin A, Peña J, Sandholm T (2010) Smoothing techniques for computing Nash equilibria of sequential games. Math Oper Res 35(2):494–512
Article MathSciNet Google Scholar
Horák K, Bošanskỳ B, Kovařík V, Kiekintveld C (2023) Solving zero-sum one-sided partially observable stochastic games. Artif Intell 316:103838
Article MathSciNet Google Scholar
Horák K (2019) Scalable algorithms for solving stochastic games with limited partial observability. PhD thesis, Czech Technical University in Prague, Faculty of Electrical Engineering
Horák K, Bošanský B (2019) Solving partially observable stochastic games with public observations. In: Proceedings of the thirty-third AAAI conference on artificial intelligence, pp 2029–2036
Horák K, Bošanský B, Pěchouček M (2017) Heuristic search value iteration for one-sided partially observable stochastic games. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, pp 558–564
Koller D, Megiddo N, von Stengel B (1994) Fast algorithms for finding randomized strategies in game trees. In: Proceedings of the 26th ACM symposium on the theory of computing (STOC’94), pp 750–759
Koller D, Megiddo N, von Stengel B (1996) Efficient computation of equilibria for extensive two-person games. Games Econom Behav 14(51):220–246
MathSciNet Google Scholar
Korf RE (1990) Real-time heuristic search. Artif Intell 42(2):189–211. https://doi.org/10.1016/0004-3702(90)90054-4
Article MathSciNet Google Scholar
Kovařík V, Seitz D, Lisý V, Rudolf J, Sun S, Ha K (2019) Value functions for depth-limited solving in imperfect-information games. Corr arXiv:1906.06412
Kovařík V, Seitz D, Lisý V, Rudolf J, Sun S, Ha K (2022) Value functions for depth-limited solving in zero-sum imperfect-information games. Artif Intell. https://doi.org/10.1016/j.artint.2022.103805
Article Google Scholar
Kovařík V, Schmid M, Burch N, Bowling M, Lisý V (2019) Rethinking formal models of partially observable multiagent decision making. CoRR arXiv:1906.11110
Kroer C, Waugh K, Kılınç-Karzan F, Sandholm T (2020) Faster algorithms for extensive-form game solving via improved smoothing functions. Math Program 179:385–417. https://doi.org/10.1007/s10107-018-1336-7
Article MathSciNet Google Scholar
Kuhn HW (1950) Simplified two-person Poker. In: Kuhn HW, Tucker AW (eds) Contributions to the theory of games, vol 1
Lanctot M, Waugh K, Zinkevich M, Bowling M (2009) Monte Carlo sampling for regret minimization in extensive games. Adv Neural Inf Process Syst
Lanctot M, Lockhart E, Lespiau J-B, Zambaldi V, Upadhyay S, Pérolat J, Srinivasan S, Timbers F, Tuyls K, Omidshafiei S, Hennes D, Morrill D, Muller P, Ewalds T, Faulkner R, Kramár J, Vylder BD, Saeta B, Bradbury J, Ding D, Borgeaud S, Lai M, Schrittwieser J, Anthony T, Hughes E, Danihelka I, Ryan-Davis J (2019) OpenSpiel: a framework for reinforcement learning in games. CoRR arXiv:1908.09453 [cs.LG]
Moravčík M, Schmid M, Burch N, Lisý V, Morrill D, Bard N, Davis T, Waugh K, Johanson M, Bowling M (2017) DeepStack: expert-level artificial intelligence in heads-up no-limit Poker. Science 356(6337):508–513
Article MathSciNet Google Scholar
Oliehoek F, Vlassis N (2006) Dec-POMDPs and extensive form games: equivalence of models and algorithms. Technical Report IAS-UVA-06-02, Intelligent Systems Laboratory Amsterdam, University of Amsterdam
Oliehoek FA (2013) Sufficient plan-time statistics for decentralized POMDPs. In: Twenty-third international joint conference on artificial intelligence
Schmid M (2021) Search in imperfect information games. PhD thesis, Charles University - Univerzita Karlova, Prague. arXiv:2111.05884
Schmid M, Moravcik M, Burch N, Kadlec R, Davidson J, Waugh K, Bard N, Timbers F, Lanctot M, Holland Z, Davoodi E, Christianson A, Bowling M (2021) Player of games. CoRR arXiv:2112.03178
Smith T (2007) Probabilistic planning for robotic exploration. PhD thesis, The Robotics Institute, Carnegie Mellon University
Smith T, Simmons RG (2005) Point-based POMDP algorithms: improved analysis and implementation. In: Proceedings of the twenty-first conference on uncertainty in artificial intelligence, pp 542–549
Sokota S, D’Orazio R, Ling CK, Wu DJ, Kolter JZ, Brown N (2023) Abstracting imperfect information away from two-player zero-sum games. arXiv preprint arXiv:2301.09159
Szer D, Charpillet F, Zilberstein S (2005) MAA*: a heuristic search algorithm for solving decentralized POMDPs. In: Proceedings of the twenty-first conference on uncertainty in artificial intelligence, pp 576–583
Tammelin O (2014) Solving large imperfect information games using CFR+. CoRR arXiv:1407.5042
von Neumann J (1928) Zur Theorie der Gesellschaftsspiele. Mathematische Annalen 100
von Stengel B (1996) Efficient computation of behavior strategies. Games Econom Behav 14(50):220–246
Article MathSciNet Google Scholar
Wiggers A (2015) Structure in the value function of two-player zero-sum games of incomplete information. Master’s thesis, University of Amsterdam
Wiggers A, Oliehoek F, Roijers D (2016) Structure in the value function of two-player zero-sum games of incomplete information. In: Proceedings of the twenty-second european conference on artificial intelligence, pp 1628–1629. https://doi.org/10.3233/978-1-61499-672-9-1628
Wiggers A, Oliehoek F, Roijers D (2016) Structure in the value function of two-player zero-sum games of incomplete information. Comput Res Repos arXiv:1606.06888
Xie Y, Dibangoye J, Buffet O (2020) Optimally solving two-agent decentralized POMDPs under one-sided information sharing. In: International conference on machine learning, pp 10473–10482. PMLR
Zinkevich M, Johanson M, Bowling M, Piccione C (2007) Regret minimization in games with incomplete information. In: Advances in neural information processing systems, vol 20

Download references

Funding

This research was funded, in whole or in part, by ANR Project PLASMA (ANR-19-CE23-0018). A CC-BY public copyright license has been applied by the authors to the present document and will be applied to all subsequent versions up to the Author Accepted Manuscript arising from this submission, in accordance with the Grant’s open access conditions. https://creativecommons.org/licenses/by/4.0/.

Author information

Authors and Affiliations

INSA Lyon and Inria, CITI, Universite de Lyon, 69621, Villeurbanne, France
Aurélien Delage
INRIA, CNRS, LORIA, Universite de Lorraine, 54000, Villers-lès-Nancy, France
Olivier Buffet
Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence, University of Groningen, Broerstraat 5, 9712, CP, Groningen, The Netherlands
Jilles S. Dibangoye
University of New South Wales, Sidney, 2052, Australia
Abdallah Saffidine

Authors

Aurélien Delage
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Buffet
View author publications
You can also search for this author in PubMed Google Scholar
Jilles S. Dibangoye
View author publications
You can also search for this author in PubMed Google Scholar
Abdallah Saffidine
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aurélien Delage.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work has been conducted while Jilles Dibangoye was at INSA Lyon.

Appendices

Appendix A Synthetic Tables

For convenience, we provide two synthetic tables: Table 3 to sum up various theoretical properties that are stated in this paper (assuming a finite temporal horizon), Table 4 and Table 5 to, respectively, sum up the notations and abbreviations used in this paper, adding some notations that appear only in the appendix.

More precisely, Table 3 indicates, for various functions f and variables x, properties that f is known to exhibit with respect to x. We denote by

-:: a function with no known (or used) property (see also comment below);
n/a:: a non-applicable case;
Lin:: a linear function;
LC:: a Lipschitz-continuous function;
Cv:: (resp. Cc) a convex (resp. concave) function;
$\textit{PWLCv}$:: (resp. PWLCc) a piecewise linear and convex (resp. concave) function;
$\perp \!\!\! \perp $:: the function being independent of the variable;
$\lnot P$:: the negation of some property P (i.e., P is known not to hold).

Note also that, as $\sigma _\tau = \sigma _\tau ^{c,1} \sigma _\tau ^{m,1}$, the linearity or Lipschitz-continuity properties of any function w.r.t. $\sigma _\tau $ extends to both $\sigma _\tau ^{c,1}$ and $\sigma _\tau ^{m,1}$. Reciprocally, related negative results extend from $\sigma _\tau ^{c,1}$ or $\sigma _\tau ^{m,1}$ to $\sigma _\tau $. In these three columns, we just indicate results that cannot be derived from one of the two other columns.

Table 3 Known properties of various functions appearing in this work

Full size table

Table 4 Various notations used in this work

Full size table

Table 5 Various abbreviations used in this work

Full size table

Appendix B Background

1.1 B.1 Occupancy States

The following result shows that the occupancy state is (i) Markovian, i.e., its value at $\tau $ only depends on its previous value $\sigma _{\tau -1}$, the system dynamics $P_{a^1,a^2}^{z^1,z^2}(s'|s)$, and the last behavioral decision rules $\beta ^1_{\tau -1}$ and $\beta ^2_{\tau -1}$, and (ii) sufficient to estimate the expected reward. Note that it holds for general-sum POSGs with any number of agents, and as many reward functions; similar results have already been established, e.g., for Dec-POMDPs [14, Theorem 1].

Proposition 12

(originally stated on page 7) $\sigma _{{\varvec{\beta }}_{0:\tau -1}}$, together with ${\varvec{\beta }}_\tau $, is a sufficient statistic to compute (i) the next os, $T(\sigma _{{\varvec{\beta }}_{0:\tau -1}},{\varvec{\beta }}_\tau ) {\mathop {=}\limits ^{\tiny {\text {def}}}}\sigma _{{\varvec{\beta }}_{0:\tau }}$, and (ii) the expected reward at $\tau $: $r(\sigma _{{\varvec{\beta }}_{0:\tau -1}},{\varvec{\beta }}_\tau ) {\mathop {=}\limits ^{\tiny {\text {def}}}}{{\,\mathrm{\mathbb {E}}\,}}\left[ R_\tau \mid {\varvec{\beta }}_{0:\tau -1} \oplus {\varvec{\beta }}_\tau \right] $, where $\oplus $ denotes a concatenation.

Proof

Let us first derive a recursive way of computing $ \sigma _{ {\varvec{\beta }}_{0:\tau } }( {\varvec{\theta }}_\tau , {\varvec{a}}_\tau , {\varvec{z}}_{\tau +1}) $:

$$\begin{aligned}&\sigma _{ {\varvec{\beta }}_{0:\tau } }( {\varvec{\theta }}_\tau , {\varvec{a}}_\tau , {\varvec{z}}_{\tau +1}) {\mathop {=}\limits ^{\tiny {\text {def}}}}Pr( {\varvec{\theta }}_\tau , {\varvec{a}}_\tau , {\varvec{z}}_{\tau +1} \mid {\varvec{\beta }}_{0:\tau } ) \\&\quad = \sum _{s_\tau , s_{\tau +1}} Pr( {\varvec{\theta }}_\tau , {\varvec{a}}_\tau , {\varvec{z}}_{\tau +1}, s_\tau , s_{\tau +1} \mid {\varvec{\beta }}_{0:\tau } )\\&\quad = \sum _{s_\tau , s_{\tau +1}} Pr( {\varvec{z}}_{\tau +1}, s_{\tau +1} \mid {\varvec{\theta }}_\tau , {\varvec{a}}_\tau , s_\tau , {\varvec{\beta }}_{0:\tau } ) Pr( {\varvec{a}}_\tau \mid {\varvec{\theta }}_\tau , s_\tau , {\varvec{\beta }}_{0:\tau } )\\&\qquad \times \, Pr( s_\tau \mid {\varvec{\theta }}_\tau , {\varvec{\beta }}_{0:\tau } ) Pr( {\varvec{\theta }}_\tau \mid {\varvec{\beta }}_{0:\tau } )\\&\quad = \sum _{s_\tau , s_{\tau +1}} \underbrace{ Pr( {\varvec{z}}_{\tau +1}, s_{\tau +1} \mid {\varvec{a}}_\tau , s_\tau ) }_{= P_{{\varvec{a}}_\tau }^{{\varvec{z}}_{\tau +1}}(s_{\tau +1}|s_\tau )} \underbrace{ Pr( {\varvec{a}}_\tau \mid {\varvec{\theta }}_\tau , {\varvec{\beta }}_{\tau } ) }_{= {\varvec{\beta }}({\varvec{\theta }}_\tau , {\varvec{a}}_\tau )} \underbrace{ Pr( s_\tau \mid {\varvec{\theta }}_\tau , {\varvec{\beta }}_{0:\tau } ) }_{= b(s_\tau \mid {\varvec{\theta }}_\tau )} \underbrace{ Pr( {\varvec{\theta }}_\tau \mid {\varvec{\beta }}_{0:\tau -1} ) }_{= \sigma _{{\varvec{\beta }}_{0:\tau -1}}({\varvec{\theta }}_\tau )}, \end{aligned}$$

(where $b(s \mid {\varvec{\theta }}_\tau )$ is the belief over states obtained by a usual HMM filtering process)

$$\begin{aligned}&= \sum _{s_\tau , s_{\tau +1}} P_{{\varvec{a}}_\tau }^{{\varvec{z}}_{\tau +1}}(s_{\tau +1}|s_\tau ) {\varvec{\beta }}({\varvec{\theta }}_\tau , {\varvec{a}}_\tau ) b(s_\tau \mid {\varvec{\theta }}_\tau ) \sigma _{{\varvec{\beta }}_{0:\tau -1}}({\varvec{\theta }}_\tau ). \end{aligned}$$

$\sigma _{ {\varvec{\beta }}_{0:\tau } }$ can thus be computed from $\sigma _{{\varvec{\beta }}_{0:\tau -1}}$ and ${\varvec{\beta }}_\tau $ without explicitly using ${\varvec{\beta }}_{0:\tau -1}$ or earlier occupancy states.

Then, let us compute the expected reward at $\tau $ given ${\varvec{\beta }}_{0:\tau }$:

$$\begin{aligned}&E[r(S_\tau ,A^1_\tau ,A^2_\tau ) \mid {\varvec{\beta }}_{0:\tau } ] \\&\quad = \sum _{s_\tau , {\varvec{a}}_\tau } r(s_\tau , {\varvec{a}}_\tau ) Pr( s_\tau , {\varvec{a}}_\tau \mid {\varvec{\beta }}_{0:\tau } )\\&\quad = \sum _{s_\tau , {\varvec{a}}_\tau } \sum _{{\varvec{\theta }}_\tau } r(s_\tau , {\varvec{a}}_\tau ) Pr( s_\tau , {\varvec{a}}_\tau , {\varvec{\theta }}_\tau \mid {\varvec{\beta }}_{0:\tau } )\\&\quad = \sum _{s_\tau , {\varvec{a}}_\tau } \sum _{{\varvec{\theta }}_\tau } r(s_\tau , {\varvec{a}}_\tau ) Pr( s_\tau , {\varvec{a}}_\tau \mid {\varvec{\theta }}_\tau , {\varvec{\beta }}_{0:\tau } ) Pr( {\varvec{\theta }}_\tau \mid {\varvec{\beta }}_{0:\tau } )\\&\quad = \sum _{s_\tau , {\varvec{a}}_\tau } \sum _{{\varvec{\theta }}_\tau } r(s_\tau , {\varvec{a}}_\tau ) \underbrace{ Pr( {\varvec{a}}_\tau \mid {\varvec{\theta }}_\tau , {\varvec{\beta }}_{0:\tau } ) }_{ {\varvec{\beta }}_\tau ({\varvec{\theta }}_\tau , {\varvec{a}}_\tau ) } \underbrace{ Pr( s_\tau \mid {\varvec{\theta }}_\tau , {\varvec{\beta }}_{0:\tau } ) }_{ b(s_\tau \mid {\varvec{\theta }}_\tau ) } \underbrace{ Pr( {\varvec{\theta }}_\tau \mid {\varvec{\beta }}_{0:\tau } ) }_{\sigma _{{\varvec{\beta }}_{0:\tau -1}}( {\varvec{\theta }}_\tau ) }\\&\quad = \sum _{s_\tau , {\varvec{a}}_\tau } \sum _{{\varvec{\theta }}_\tau } r(s_\tau , {\varvec{a}}_\tau ) {\varvec{\beta }}_\tau ({\varvec{\theta }}_\tau , {\varvec{a}}_\tau ) b(s_\tau \mid {\varvec{\theta }}_\tau ) \sigma _{{\varvec{\beta }}_{0:\tau -1}}( {\varvec{\theta }}_\tau ). \end{aligned}$$

The expected reward at $\tau $ can thus be computed from $\sigma _{{\varvec{\beta }}_{0:\tau -1}}$ and ${\varvec{\beta }}_\tau $ without explicitly using ${\varvec{\beta }}_{0:\tau -1}$ or earlier occupancy states. $\square $

Appendix C Occupancy Markov Games: Definition and Preliminary Properties

1.1 C.1 Properties of $V^*$

Before proving the postulates implicitly used by [47], we need to show that we can reason with mixed strategies in subgames as is usually done on full games.

1.1.1 C.1.1 $V^*$ is Not Linear in $\beta $ (Behavioral Strategies)

Let us consider the following (finite-horizon, deterministic) Non-Observable MDP:

$$\begin{aligned} {{\mathcal {S}}}&{\mathop {=}\limits ^{\tiny {\text {def}}}}\{ -2, -1, 0, +1, +2\}, \quad b_0(0) =1,&(\text {always start in}\, s=0) \\ {{\mathcal {A}}}&{\mathop {=}\limits ^{\tiny {\text {def}}}}\{-1, +1\},&\text {(moves = add or subtract 1)} \\ T(s,a)&{\mathop {=}\limits ^{\tiny {\text {def}}}}\min \{ +2, \max \{ -2, s+a \} \},&(+1\,\,\text {or}\,\, -1\,\, \text {move in}\,\, {{\mathcal {S}}}) \\ {{\mathcal {Z}}}&{\mathop {=}\limits ^{\tiny {\text {def}}}}\{ none \}, \quad O(none) {\mathop {=}\limits ^{\tiny {\text {def}}}}1,&\text {(no observation)} \\ r(s)&{\mathop {=}\limits ^{\tiny {\text {def}}}}{\left\{ \begin{array}{ll} +1 &{} \text {if }s \in \{-2,+2\} \\ 0 &{} \text {otherwise,} \end{array}\right. }&(|s|=2 :\, \text {success!})\\ \gamma&{\mathop {=}\limits ^{\tiny {\text {def}}}}1, \quad H {\mathop {=}\limits ^{\tiny {\text {def}}}}2. \end{aligned}$$

Let us then consider two particular behavioral strategies:

$$\begin{aligned} \forall \theta ,\ \beta ^{+}(A=+1|\theta )&= 1&(\text {always}\, +1), \text {and} \\ \forall \theta ,\ \beta ^{-}(A=-1|\theta )&= 1&(\text {always}\,\, -1). \end{aligned}$$

These two strategies are optimal, with an expected return of $+1$, because, at $t=H=2$, $\beta ^+$ reaches $+2$ w.p. 1, and $\beta ^-$ reaches $-2$ w.p. 1:

$$\begin{aligned} V(\beta ^+)&= V(\beta ^-) = +1. \end{aligned}$$

Let us now consider their linear combination $\beta ^\pm {\mathop {=}\limits ^{\tiny {\text {def}}}}\frac{1}{2}\beta ^+ + \frac{1}{2} \beta ^-$:

$$\begin{aligned} \forall \theta ,\ \beta ^{\pm }(A=-1|\theta )&= 0.5, \\ \forall \theta ,\ \beta ^{\pm }(A=+1|\theta )&= 0.5. \end{aligned}$$

Here, the probability to reach $s=-2$ or $s=+2$ at the last time step is much lower, and gives the value of that strategy:

$$\begin{aligned} V(\beta ^{\pm })&= Pr(s_2=+2 | \beta ^{\pm } ) + Pr(s_2=+2 | \beta ^{\pm } ) \\&= Pr(a_0=+1 | \beta ^{\pm } ) \cdot Pr(a_1=+1 | \beta ^{\pm } ) \\&\quad + Pr(a_0=-1 | \beta ^{\pm } ) \cdot Pr(a_1=-1 | \beta ^{\pm } ) \\&= \underbrace{0.5\cdot 0.5}_{0.25} + \underbrace{0.5\cdot 0.5}_{0.25} = 0.5. \end{aligned}$$

This confirms that, in a POMDP, the value is not linear in the space of behavioral strategies. As a consequence, in a zs-POSG, the value is not (bi)linear in the spaces of behavioral strategies of both players.

1.1.2 C.1.2 Back to Mixed Strategies

We now generalize mixed strategies as a mathematical tool to handle subgames of a zs-OMG as normal-form games, and give some preliminary results.

First, for a given $\sigma _\tau $ and $\tau \le \tau '$, let ${\varvec{\mu }}_{0:\tau '-1|\sigma _\tau \rangle }$ denote a mixed strategy profile that is defined over $0:\tau '-1$, and induces $\sigma _\tau $ at time $\tau $. We will also write that it is compatible with $\sigma _\tau $. Then, to complete a given mixed prefix strategy ${\varvec{\mu }}_{0:\tau '-1|\sigma _\tau \rangle }$ (here $\tau =\tau '$), the solver should provide each player with a different suffix strategy to execute for each $\theta ^i_\tau $ it could be facing. We now detail how to build an equivalent set of mixed full strategies for i. Each of the pure prefix strategies $\pi ^i_{0:\tau -1}$ used in $\mu ^i_{0:\tau -1|\sigma _\tau \rangle }$ (belonging to a set denoted $\Pi ^i_{0:\tau -1|\sigma _\tau \rangle }$) can be extended by appending a different pure suffix strategy $\pi ^i_{\tau :H-1}$ at each of its leaf nodes, which leads to a large set of pure strategies $\Pi ^i_{0:H-1}(\pi ^i_{0:\tau -1})$. Then, let $M^i_{0:H-1|\sigma _\tau \rangle }$ be the set of mixed full strategies $\mu ^i_{0:H-1|\sigma _\tau \rangle }$ obtained by considering the distributions over $\bigcup _{\pi ^i_{0:\tau -1}\in \Pi ^i_{0:\tau -1|\sigma _\tau \rangle }} \Pi ^i_{0:H-1}(\pi ^i_{0:\tau -1})$ that verify, $\forall \pi ^i_{0:\tau -1}$,

$$\begin{aligned} \sum _{\begin{array}{c} \pi ^i_{0:H-1} \in \\ \Pi ^i_{0:H-1}(\pi ^i_{0:\tau -1}) \hspace{-1cm} \end{array}} \mu ^i_{0:H-1|\sigma _\tau \rangle }(\pi ^i_{0:H-1})&= \mu ^i_{0:\tau -1|\sigma _\tau \rangle }(\pi ^i_{0:\tau -1}). \end{aligned}$$

(C1)

This is the set of mixed strategies compatible with $\sigma _\tau $.

Lemma 2

$M^i_{0:H-1|\sigma _\tau \rangle }$ is convex and equivalent to the set of behavioral strategies $\beta ^i_{0:H-1|\sigma _\tau \rangle }$, thus sufficient to search for a Nash equilibrium in $\sigma _\tau $.

Proof

Let $\mu ^i_{0:H-1|\sigma _\tau \rangle }$ and $\nu ^i_{0:H-1|\sigma _\tau \rangle }$ be two mixed strategies in $M^i_{0:H-1|\sigma _\tau \rangle }$, i.e., which are both full and compatible with occupancy state $\sigma _\tau $ at time step $\tau $, and $\alpha \in [0,1]$.

Then, for any $\pi ^i_{0:\tau -1}$,

$$\begin{aligned}&\sum _{\pi ^i_{0:H-1} \in \Pi ^i_{0:H-1}(\pi ^i_{0:\tau -1})} \left[ \alpha \cdot \mu ^i_{0:H-1|\sigma _\tau \rangle }(\pi ^i_{0:H-1}) + (1-\alpha ) \cdot \nu ^i_{0:H-1|\sigma _\tau \rangle }(\pi ^i_{0:H-1}) \right] \\&\quad = \alpha \left[ \sum _{\pi ^i_{0:H-1} \in \Pi ^i_{0:H-1}(\pi ^i_{0:\tau -1})} \mu ^i_{0:H-1|\sigma _\tau \rangle }(\pi ^i_{0:H-1}) \right] \\&\qquad + (1-\alpha ) \left[ \sum _{\pi ^i_{0:H-1} \in \Pi ^i_{0:H-1}(\pi ^i_{0:\tau -1})} \nu ^i_{0:H-1|\sigma _\tau \rangle }(\pi ^i_{0:H-1}) \right] \\ \end{aligned}$$

(because both mixed strategies are compatible with $\sigma _\tau $ (eq. C1, p. 35):)

$$\begin{aligned}&= \alpha \cdot \mu ^i_{0:\tau -1|\sigma _\tau \rangle }(\pi ^i_{0:\tau -1}) + (1-\alpha ) \cdot \mu ^i_{0:\tau -1|\sigma _\tau \rangle }(\pi ^i_{0:\tau -1}) \\&= \mu ^i_{0:\tau -1|\sigma _\tau \rangle }(\pi ^i_{0:\tau -1}). \end{aligned}$$

Equation C1 thus also applies to $\alpha \cdot \mu ^i_{0:H-1|\sigma _\tau \rangle } + (1-\alpha ) \cdot \nu ^i_{0:H-1|\sigma _\tau \rangle }$, proving that it belongs to $M^i_{0:H-1|\sigma _\tau \rangle }$ and, as a consequence, that this set is convex.

The equivalence with the set of behavioral strategies simply relies on the fact that all mixed strategies over $\tau :H-1$ can be independently generated at each action-observation history $\theta ^i_{0:\tau -1}$. $\square $

While only future rewards are relevant when making a decision at $\tau $, reasoning with mixed strategies defined from $t=0$ will be convenient because $V_\tau (\sigma _\tau ,\cdot ,\cdot )$ is linear in $\mu ^i_{0:H-1 | \sigma _\tau \rangle }$, which allows coming back to a standard normal-form game and applying known results.

In the remaining, we simply note $\mu ^i$ (without index) the mixed strategies in $M^i_{0:H-1|\sigma _\tau \rangle }$, set which we now note $M^i_{|\sigma _\tau \rangle }$. Also, since we shall work with local game $Q^*_\tau (\sigma _\tau ,{\varvec{\beta }}_\tau )$, let us define: $M^i_{|\sigma _\tau ,\beta ^j_\tau \rangle }$ the set of i’s mixed strategies compatible with occupancy states reachable given $\sigma _\tau $ and $\beta ^j_\tau $ (with either $j=i$ or $j=\lnot i$). Then, $M^i_{|T(\sigma _\tau ,{\varvec{\beta }}_\tau )\rangle } \subseteq M^i_{|\sigma _\tau ,\beta ^j_\tau \rangle } \subseteq M^i_{|\sigma _\tau \rangle }$ (inclusion due to the latter sets being less constrained in their definition). As a consequence, if maximizing some function f over i’s mixed strategies compatible with a given $\sigma _\tau $:

$$\begin{aligned} \max _{\mu ^i \in M^i_{| \sigma _\tau \rangle }} f(\sigma _\tau , \mu ^i, \dots )&\ge \max _{\mu ^i \in M^i_{| \sigma _\tau , \beta ^j_\tau \rangle }} f(\sigma _\tau , \mu ^i, \dots ) \ge \max _{\mu ^i \in M^i_{| \sigma _\tau , {\varvec{\beta }}_\tau \rangle }} f(\sigma _\tau , \mu ^i, \dots ). \end{aligned}$$

1.1.3 C.1.3 Von Neumann’s Minimax Theorem for Subgames and a Bellman Optimality Equation

Using the previous results, one can show that von Neumann’s minimax theorem applies in any subgame, allowing to swap $\max $ and $\min $ operators.

Theorem 3

(originally stated on page 10) The subgame defined in Eq. (3) admits a unique NEV

$$\begin{aligned} V_\tau ^*(\sigma _\tau )&{\mathop {=}\limits ^{\tiny {\text {def}}}}\max _{\beta _{\tau :H-1}^1} \min _{\beta _{\tau :H-1}^2} V_\tau ( \sigma _\tau , \beta _{\tau :H-1}^1,\beta _{\tau :H-1}^2). \end{aligned}$$

(4)

Proof

For any occupancy state $\sigma _\tau $,

$$\begin{aligned} \max _{\beta _{\tau :}^1} \min _{\beta _{\tau :}^2} V(\sigma _\tau ,{\varvec{\beta }}_{\tau :})&= \max _{\mu _{\tau :}^1} \min _{\mu _{\tau :}^2} V(\sigma _\tau ,{\varvec{\mu }}_{\tau :}) \qquad \text {(Kuhn's theorem (generalized))}\end{aligned}$$

(C2)

$$\begin{aligned}&= \min _{\mu _{\tau :}^2} \max _{\mu _{\tau :}^1} V(\sigma _\tau ,{\varvec{\mu }}_{\tau :}) \qquad \text {(von Neumann's theorem)}\end{aligned}$$

(C3)

$$\begin{aligned}&= \min _{\beta _{\tau :}^2} \max _{\beta _{\tau :}^1} V(\sigma _\tau ,{\varvec{\beta }}_{\tau :}) \qquad \text {(again Kuhn's theorem (generalized))} \end{aligned}$$

(C4)

$\square $

One can also show that a Bellman optimality equation allows relating optimal values in subgames at $\tau $ and $\tau +1$, leading to a recursive expression of $V^*_0$.

Theorem 4

(originally stated on page 10) $V_\tau ^*(\sigma _\tau )$ satisfies the following functional equation:

$$\begin{aligned} V_\tau ^*(\sigma _\tau )&= \max _{\beta _\tau ^1} \min _{\beta _\tau ^2} r(\sigma _\tau , {\varvec{\beta }}_\tau ) + \gamma V^*_{\tau +1}( T(\sigma _\tau ,{\varvec{\beta }}_\tau ) ) = \max _{\beta _\tau ^1} \min _{\beta _\tau ^2} Q^*_\tau (\sigma _\tau , {\varvec{\beta }}_\tau ). \end{aligned}$$

Proof

Focusing, without loss of generality, on player 1, we have (complementary explanations follow for numbered lines in particular):

$$\begin{aligned}&\max _{\beta ^1_\tau } \min _{\beta ^2_\tau } Q^*_\tau (\sigma _\tau ,\beta ^1_\tau ,\beta ^2_\tau ) = \max _{\beta ^1_\tau } \min _{\beta ^2_\tau } \left[ r(\sigma _\tau , \beta ^1_\tau , \beta ^2_\tau ) + \gamma V^*_{\tau +1}(T(\sigma _\tau ,\beta ^1_\tau , \beta ^2_\tau )) \right] \end{aligned}$$

($V^*_{\tau +1}(T(\sigma _\tau ,\beta ^1_\tau , \beta ^2_\tau ))$ being the Nash equilibrium value of normal-form game $V_{\tau +1}(T(\sigma _\tau ,$$\beta ^1_\tau , \beta ^2_\tau ), \mu ^1, \mu ^2 )$:)

$$\begin{aligned}&= \max _{\beta ^1_\tau } \min _{\beta ^2_\tau } \left[ r(\sigma _\tau , \beta ^1_\tau , \beta ^2_\tau ) + \gamma \max _{\mu ^1\in M^1_{| \sigma _\tau , {\varvec{\beta }}_\tau \rangle }} \min _{\mu ^2\in M^2_{| \sigma _\tau , {\varvec{\beta }}_\tau \rangle }} V_{\tau +1}(T(\sigma _\tau ,\beta ^1_\tau , \beta ^2_\tau ), \mu ^1, \mu ^2 ) \right] \\&= \max _{\beta ^1_\tau } \min _{\beta ^2_\tau } \max _{\mu ^1\in M^1_{| \sigma _\tau , {\varvec{\beta }}_\tau \rangle }} \min _{\mu ^2\in M^2_{| \sigma _\tau , {\varvec{\beta }}_\tau \rangle }} \left[ r(\sigma _\tau , \beta ^1_\tau , \beta ^2_\tau ) + \gamma V_{\tau +1}(T(\sigma _\tau ,\beta ^1_\tau , \beta ^2_\tau ), \mu ^1, \mu ^2 ) \right] \\ \end{aligned}$$

(using the equivalence between maximin and minimax values for the (constrained normal-form) game at $\tau +1$, the last two max and min operators can be swapped:)

$$\begin{aligned}&= \max _{\beta ^1_\tau } \min _{\beta ^2_\tau } \min _{\mu ^2\in M^2_{| \sigma _\tau , {\varvec{\beta }}_\tau \rangle }} \max _{\mu ^1\in M^1_{| \sigma _\tau , {\varvec{\beta }}_\tau \rangle }} \left[ r(\sigma _\tau , \beta ^1_\tau , \beta ^2_\tau ) + \gamma V_{\tau +1}(T(\sigma _\tau ,\beta ^1_\tau , \beta ^2_\tau ), \mu ^1, \mu ^2) \right] \end{aligned}$$

(merging both mins (and with explanations thereafter):)

$$\begin{aligned}&= \max _{\beta ^1_\tau } \min _{\mu ^2 \in M^2_{| \sigma _\tau , \beta ^1_\tau \rangle }} \max _{\mu ^1 \in M^1_{| \sigma _\tau , \beta ^1_\tau , \beta ^2_\tau (\mu ^2) \rangle }}\nonumber \\&\quad \left[ r(\sigma _\tau , \beta ^1_\tau , \beta ^2_\tau (\mu _2)) + \gamma V_{\tau +1}(T(\sigma _\tau ,\beta ^1_\tau , \beta ^2_\tau (\mu _2)), \mu ^1, \mu ^2 ) \right] \end{aligned}$$

(C5)

(since ignoring the opponent’s decision rule does not influence the expected return:)

$$\begin{aligned}&= \max _{\beta ^1_\tau } \min _{\mu ^2\in M^2_{| \sigma _\tau \rangle }} \max _{\mu ^1 \in M^1_{| \sigma _\tau , \beta ^1_\tau \rangle }} \left[ r(\sigma _\tau , \beta ^1_\tau , \beta ^2_\tau (\mu _2)) + \gamma V_{\tau +1}(T(\sigma _\tau ,\beta ^1_\tau , \beta ^2_\tau (\mu _2)), \mu ^1, \mu ^2 ) \right] \end{aligned}$$

(using again the minimax theorem’s equivalence between maximin and minimax on an appropriate game:)

$$\begin{aligned}&= \max _{\beta ^1_\tau } \max _{\mu ^1\in M^1_{| \sigma _\tau , \beta ^1_\tau \rangle }} \min _{\mu ^2\in M^2_{| \sigma _\tau \rangle }} \left[ r(\sigma _\tau , \beta ^1_\tau , \beta ^2_\tau (\mu _2)) + \gamma V_{\tau +1}(T(\sigma _\tau ,\beta ^1_\tau , \beta ^2_\tau (\mu _2)), \mu ^1, \mu ^2 ) \right] \end{aligned}$$

(C6)

(merging both maxs (and with explanations thereafter):)

$$\begin{aligned}&= \max _{\mu ^1\in M^1_{| \sigma _\tau \rangle }} \min _{\mu ^2_{\tau } | \sigma _\tau )} \left[ r(\sigma _\tau , \beta ^1_\tau (\mu _1), \beta ^2_\tau (\mu _2)) + \gamma V_{\tau +1}(T(\sigma _\tau ,\beta ^1_\tau (\mu ^1), \beta ^2_\tau (\mu _2)), \mu ^1, \mu ^2 ) \right] \end{aligned}$$

(C7)

(again with the equivalence property discussed before the lemma:)

$$\begin{aligned}&= \max _{\mu ^1\in M^1_{| \sigma _\tau \rangle }} \min _{\mu ^2\in M^2_{| \sigma _\tau \rangle }} V_\tau (\sigma _\tau , \mu ^1, \mu ^2) \\&= \max _{\beta ^1_{\tau :H-1 | \sigma _\tau \rangle }} \min _{\beta ^2_{\tau :H-1 | \sigma _\tau \rangle }} V_\tau (\sigma _\tau , \beta ^1_{\tau :H-1}, \beta ^2_{\tau :H-1}) \\&{\mathop {=}\limits ^{\tiny {\text {def}}}}V^*_\tau (\sigma _\tau ). \end{aligned}$$

More precisely, line C5 (and, similarly, line C7) is obtained by observing that

minimizing over both (i) $\beta ^2_\tau $ and (ii) $\mu ^2$ constrained by $\sigma _\tau $ and ${\varvec{\beta }}_\tau $ is equivalent to minimizing over $\mu ^2$ constrained by $\sigma _\tau $ and $\beta ^1_\tau $; and
in the reminder of the formula, decision rule $\beta ^2_\tau $ at time $\tau $ can be retrieved as a function of $\mu ^2$ (noted $\beta ^2_\tau (\mu ^2)$).

Also, line C6 results from the observation that, while $M^1_{| \sigma _\tau , \beta ^1_\tau \rangle }$ and $M^2_{| \sigma _\tau \rangle }$ allow to actually make decisions over different time intervals, we are here minimizing over $\mu ^2$ while maximizing over $\mu ^1$ a function that is linear in both input spaces.

This amounts to solving some 2-player zero-sum normal-form game, hence the applicability of von Neumann’s minimax theorem.

The above derivation tells us that the maximin value (the best outcome player 1 can guarantee whatever player 2’s strategy) in the one-time-step game is thus the Nash equilibrium value (NEV) for the complete subgame from $\tau $ onwards. $\square $

Appendix D Solving zs-OMGs

1.1 D.1 Preliminary Properties

1.1.1 D.1.1 Properties of $T^{1}_m$ and $T^{1}_c$

The first two lemmas below present properties of $T^{1}_m$ and $T^{1}_c$ that will be useful afterward.

Lemma 3

$T^{1}_m(\sigma _\tau , {\varvec{\beta }}_\tau )$ is linear in $\sigma _\tau $, $\beta ^1_\tau $, and $\beta ^2_\tau $.

Proof

$$\begin{aligned}&T^{1}_m(\sigma _\tau ,{\varvec{\beta }}_\tau )(\theta ^1_\tau ,a^1,z^1) \nonumber \\&= \sum _{\theta ^2_\tau , a^2, z^2} T(\sigma _\tau , {\varvec{\beta }}_\tau )({ ( \theta ^1_\tau , a^1, z^1 ), ( \theta ^2_\tau , a^2, z^2 ) }) \qquad (\text {from Eq.}\, (1)) \end{aligned}$$

(D8)

$$\begin{aligned}&= \sum _{s',\theta ^2_\tau ,a^2,z^2} \beta ^1_\tau (\theta ^1_\tau ,a^1) \beta ^2_\tau (\theta ^2_\tau ,a^2) \sum _{s} P^{z^1,z^2}_{a^1,a^2}(s'|s) b(s|\theta ^1_\tau ,\theta ^2_\tau ) \sigma _\tau (\theta ^1_\tau ,\theta ^2_\tau )\nonumber \\&= \beta ^1_\tau (\theta ^1_\tau ,a^1) \sum _{\theta ^2_\tau ,a^2} \beta ^2_\tau (\theta ^2_\tau ,a^2) \sum _{s,s',z^2} P^{z^1,z^2}_{a^1,a^2}(s'|s) b(s|\theta ^1_\tau ,\theta ^2_\tau ) \sigma _\tau (\theta ^1_\tau ,\theta ^2_\tau ). \end{aligned}$$

(D9)

$\square $

Lemma 4

$T^{1}_c(\sigma _\tau , {\varvec{\beta }}_\tau )$ is independent of $\beta ^1_\tau $ and $\sigma ^{m,1}_\tau $.

Proof

See [45], Lemma 4.2.3. $\square $

1.1.2 D.1.2 Linearity and Lipschitz-continuity of $T (\sigma _\tau , \beta ^1_\tau , \beta ^2_\tau )$

Lemma 5

At depth $\tau $, $T(\sigma _\tau ,{\varvec{\beta }}_\tau )$ is linear in $\beta ^1_\tau $, $\beta ^2_\tau $, and $\sigma _\tau $, where ${\varvec{\beta }}_\tau =\langle \beta ^1_\tau , \beta ^2_\tau \rangle $. It is more precisely 1-Lipschitz-continuous (1-LC) in $\sigma _\tau $ (in 1-norm), i.e., for any $\sigma _\tau $, $\sigma '_\tau $:

$$\begin{aligned} \Vert T(\sigma '_\tau ,{\varvec{\beta }}_\tau ) - T(\sigma _\tau ,{\varvec{\beta }}_\tau )\Vert _1&\le 1 \cdot \Vert \sigma '_\tau - \sigma _\tau \Vert _1. \end{aligned}$$

Proof

Let $\sigma $ be an occupancy state at time $\tau $ and ${\varvec{\beta }}_\tau $ be a decision rule.

Then, as seen in the proof of Theorem 1, the next occupancy state $\sigma ' = T(\sigma ,{\varvec{\beta }}_\tau )$ satisfies, for any $s'$ and $({\varvec{\theta }},{\varvec{a}},{\varvec{z}})$:

$$\begin{aligned} \sigma '({\varvec{\theta }},{\varvec{a}},{\varvec{z}})&{\mathop {=}\limits ^{\tiny {\text {def}}}}Pr({\varvec{\theta }},{\varvec{a}},{\varvec{z}}| \sigma , \beta ^1_\tau , \beta ^2_\tau ) \\&= \beta ^1_\tau (\theta ^1, a^1) \beta ^2_\tau (\theta ^2, a^2) \left[ \sum _{s', s\in {{\mathcal {S}}}} P_{{\varvec{a}}}^{{\varvec{z}}}(s'|s) b(s| {\varvec{\theta }}) \right] \sigma ({\varvec{\theta }}) . \end{aligned}$$

$b(s|{\varvec{\theta }})$ depending only on the model (transition function and initial belief),

the next occupancy state $\sigma '$ thus evolves linearly w.r.t. (i) private decision rules $\beta ^1_\tau $ and $\beta ^2_\tau $, and (ii) the occupancy state $\sigma $.

The 1-Lipschitz-continuity holds because each component of vector $\sigma _\tau $ is distributed over multiple components of $\sigma '$.

Indeed, let us view two occupancy states as vectors ${\varvec{x}},{\varvec{y}}\in {{\mathbb {R}}}^n$, and their corresponding next states under ${\varvec{\beta }}_\tau $ as $M {\varvec{x}}$ and $M {\varvec{y}}$, where $M \in {{\mathbb {R}}}^{m\times n}$ is the corresponding transition matrix (i.e., which turns $\sigma $ into $\sigma ' {\mathop {=}\limits ^{\tiny {\text {def}}}}T(\sigma _\tau ,{\varvec{\beta }}_\tau )$).

Then,

$$\begin{aligned} \Vert M{\varvec{x}}- M{\varvec{y}}\Vert _{1}&{\mathop {=}\limits ^{\tiny {\text {def}}}}\sum _{j=1}^m \ | \sum _{i=1}^n M_{i,j} (x_i - y_i) | \\&\le \sum _{j=1}^m \sum _{i=1}^n |M_{i,j} (x_i - y_i) |&(\text {convexity of} |\cdot |) \\&= \sum _{j=1}^m \sum _{i=1}^n M_{i,j} | x_i - y_i |&(\forall {i,j},\ M_{i,j}\ge 0) \\&= \sum _{i=1}^n \underbrace{\sum _{j=1}^m M_{i,j}}_{=1} | x_i - y_i |&(M \text {is a transition matrix}) \\&{\mathop {=}\limits ^{\tiny {\text {def}}}}\Vert {\varvec{x}}-{\varvec{y}}\Vert _1. \end{aligned}$$

$\square $

1.1.3 D.1.3 Lipschitz-Continuity of $V^*$

The next two results demonstrate that, in the finite horizon setting, $V^*$ is Lipschitz-continuous (LC) in occupancy space, which allows defining LC upper- and lower-bound approximations.

Lemma 6

At depth $\tau $, $V_\tau (\sigma _\tau ,{\varvec{\beta }}_{\tau :})$ is linear w.r.t. $\sigma _\tau $ and ${\varvec{\beta }}_{\tau :}$.

Note: This result in fact applies to any reward function of a general-sum POSG with any number of agents (here N), e.g., to a Dec-POMDP. The following proof handles the general case (with ${\varvec{\beta }}_\tau {\mathop {=}\limits ^{\tiny {\text {def}}}}\langle \beta ^1_\tau , \dots , \beta ^N_\tau \rangle $, and ${\varvec{\beta }}_\tau ({\varvec{a}}|{\varvec{\theta }}) = \prod _{i=1}^N \beta ^i_\tau (a^i,\theta ^1)$).

Proof

This property trivially holds for $\tau =H-1$ because

$$\begin{aligned} V_{H-1}(\sigma _{H-1},{\varvec{\beta }}_{H-1:})&= r(\sigma _{H-1},{\varvec{\beta }}_{H-1}) \\&= \sum _{s,{\varvec{a}}} \left( \sum _{\varvec{\theta }}Pr(s,{\varvec{a}}|{\varvec{\theta }}) \sigma _{H-1}({\varvec{\theta }}) \right) r(s,{\varvec{a}}) \\&= \sum _{s,{\varvec{a}}} \left( \sum _{\varvec{\theta }}b(s|{\varvec{\theta }}) {\varvec{\beta }}_\tau ({\varvec{a}}|{\varvec{\theta }}) \sigma _{H-1}({\varvec{\theta }}) \right) r(s,{\varvec{a}}) \\&= \sum _{s,{\varvec{\theta }}} b(s | {\varvec{\theta }}) \sigma _{H-1}({\varvec{\theta }}) \left( \sum _{{\varvec{a}}} {\varvec{\beta }}_\tau ({\varvec{a}}|{\varvec{\theta }}) r(s,{\varvec{a}}) \right) . \end{aligned}$$

Now, let us assume that the property holds for .

Then,

$$\begin{aligned} V_\tau (\sigma _\tau ,{\varvec{\beta }}_{\tau :})&= \sum _{s,{\varvec{a}}} \Big ( \sum _{\varvec{\theta }}b(s| {\varvec{\theta }}) {\varvec{\beta }}_\tau ({\varvec{a}}|{\varvec{\theta }}) \sigma _\tau ({\varvec{\theta }}) \Big ) r(s, {\varvec{a}}) + \gamma V_{\tau +1}\left( T (\sigma _\tau , {\varvec{\beta }}_\tau ), {\varvec{\beta }}_{\tau +1:} \right) \\&= \sum _{s,{\varvec{\theta }}} b(s| {\varvec{\theta }}) \sigma _\tau ({\varvec{\theta }}) \Big ( \sum _{{\varvec{a}}} {\varvec{\beta }}_\tau ({\varvec{a}}|{\varvec{\theta }}) r(s, {\varvec{a}}) \Big ) + \gamma V_{\tau +1}\left( T (\sigma _\tau , {\varvec{\beta }}_\tau ), {\varvec{\beta }}_{\tau +1:} \right) . \end{aligned}$$

$T(\sigma _\tau ,{\varvec{\beta }}_\tau )$ is linear in $\sigma _\tau $ (Lemma 5) and
$V_{\tau +1}(\sigma _{\tau +1}, {\varvec{\beta }}_{\tau +1:})$ is linear in $\sigma _{\tau +1}$ (induction hypothesis),

their composition, $V_{\tau +1} ( T (\sigma _\tau ,{\varvec{\beta }}_\tau ), {\varvec{\beta }}_{\tau +1:} )$, is also linear in $\sigma _\tau $, and so is $V_\tau (\sigma _\tau ,{\varvec{\beta }}_{\tau :})$. Similarly, $V_\tau (\sigma _\tau ,{\varvec{\beta }}_\tau )$ is linear in ${\varvec{\beta }}_\tau $ for any $\sigma _\tau $. $\square $

Theorem 5

(originally stated on page 9)

Let $h_{\tau } {\mathop {=}\limits ^{\tiny {\text {def}}}}\frac{1-\gamma ^{H-\tau }}{1-\gamma }$ (or $h_{\tau } {\mathop {=}\limits ^{\tiny {\text {def}}}}H-\tau $ if $\gamma =1$).

Then $V^*_\tau (\sigma _\tau )$ is $\lambda _{\tau }$-Lipschitz continuous in $\sigma _\tau $ at any depth , where $\lambda _{\tau } = \frac{1}{2} h_{\tau } \left( r_{\max } - r_{\min } \right) $.

Proof

At depth $\tau $, the value of any behavioral strategy ${\varvec{\beta }}_{\tau :}$ is bounded, independently of $\sigma _\tau $, by

$$\begin{aligned} V^{\max }_\tau&{\mathop {=}\limits ^{\tiny {\text {def}}}}h_{\tau } r_{\max }, \quad \text {where } r_{\max } {\mathop {=}\limits ^{\tiny {\text {def}}}}\max _{s,{\varvec{a}}}r(s,{\varvec{a}}), \text { and } \\ V^{\min }_\tau&{\mathop {=}\limits ^{\tiny {\text {def}}}}h_{\tau } r_{\min }, \quad \text {where } r_{\min } {\mathop {=}\limits ^{\tiny {\text {def}}}}\min _{s,{\varvec{a}}}r(s,{\varvec{a}}). \end{aligned}$$

Thus, $V_{{\varvec{\beta }}_{\tau :}}$ being a linear function defined over a probability simplex (${{\mathcal {O}}}^\sigma _\tau $) (cf.Appendix 1) and bounded by $[V^{\min }_\tau ,V^{\max }_\tau ]$, we can apply Horák’s PhD thesis’ Lemma 3.5 (p. 33) [20] to establish that it is also $\lambda _{\tau }$-LC, i.e.,

$$\begin{aligned} |V_{{\varvec{\beta }}_{\tau :}}(\sigma ) -V_{{\varvec{\beta }}_{\tau :}}(\sigma ')|&\le \lambda _{\tau } \Vert \sigma -\sigma '\Vert _{1} \quad (\forall \sigma , \sigma '), \\ \text {with } \lambda _{\tau }&= \frac{V^{\max }_{\tau }-V^{\min }_{\tau }}{2}. \end{aligned}$$

Considering now optimal solutions, this means that, at depth $\tau $ and for any $(\sigma ,\sigma ') \in {{\mathcal {O}}}^\sigma _\tau $:

$$\begin{aligned}&V^*_\tau (\sigma ) - V^*_\tau (\sigma ') \\&\quad = \max _{\beta ^1_{\tau :}} \min _{\beta ^2_{\tau :}} V_\tau (\sigma , \beta ^1_{\tau :}, \beta ^2_{\tau :}) - \max _{\beta '^1_{\tau :}} \min _{\beta '^2_{\tau :}} V_\tau (\sigma ', \beta '^1_{\tau :}, \beta '^2_{\tau :}) \\&\quad \le \max _{\beta ^1_{\tau :}} \min _{\beta ^2_{\tau :}} \left[ V_\tau (\sigma ', \beta ^1_{\tau :}, \beta ^2_{\tau :}) + \lambda _{\tau } \Vert \sigma -\sigma '\Vert _{1} \right] - \max _{\beta '^1_{\tau :}} \min _{\beta '^2_{\tau :}} V_\tau (\sigma ', \beta '^1_{\tau :}, \beta '^2_{\tau :}) \\&\quad = \lambda _{\tau } \Vert \sigma -\sigma '\Vert _{1}. \end{aligned}$$

Symmetrically, $V^*_\tau (\sigma ) - V^*_\tau (\sigma ') \ge -\lambda _{\tau } \Vert \sigma -\sigma '\Vert _{1}$, hence the expected result:

$$\begin{aligned} | V^*_\tau (\sigma ) - V^*_\tau (\sigma ') |&\le \lambda _{\tau } \Vert \sigma -\sigma '\Vert _{1}. \end{aligned}$$

$\square $

As it will be used later, let us also present the following lemma.

Lemma 7

Let us consider , $\theta ^1_\tau $, and $\psi ^2_\tau $.

Then $\nu ^2_{[\sigma ^{c,1}_{\tau }, \psi ^2_{\tau }]}(\theta ^1_{\tau })$ is $\lambda _\tau $-LC in $\sigma ^{c,1}_\tau (\cdot |\theta ^1_\tau )$.

Equivalently, we will also write that $\nu ^2_{[\sigma ^{c,1}_{\tau }, \psi ^2_{\tau }]}$ is $\lambda _\tau $-LC in $\sigma ^{c,1}_\tau $ in vector-wise 1-norm, i.e.:

where (i) the absolute value of a vector is obtained by taking the absolute value of each component and (ii) the vector-wise 1-norm of a matrix is a vector made of the 1-norm of each of its component vectors.

Proof

For any $\theta ^1_\tau $, $\sigma ^{c,1}_{\tau }$ and $\psi ^2_{\tau }$ induce a POMDP for Player 1 from $\tau $ on,

where (i) the state at any corresponds to a pair $\langle s, \theta ^2_t \rangle $, and (ii) the initial belief is derived from $\sigma ^{c,1}_{\tau }(\cdot |\theta ^1_t)$.

The belief state at t thus gives:

$$\begin{aligned} b_{\theta ^1_t}(s, \theta ^2_t)&{\mathop {=}\limits ^{\tiny {\text {def}}}}Pr(s, \theta ^2_t | \theta ^1_t) = \underbrace{Pr(s | \theta ^2_t, \theta ^1_t)}_{b^{\textsc {hmm}}_{\theta ^2_t, \theta ^1_t}(s)} \cdot \underbrace{Pr(\theta ^2_t | \theta ^1_t)}_{\sigma ^{c,1}_t(\theta ^2_t|\theta ^1_t)}. \end{aligned}$$

So,

the value function of any behavioral strategy $\beta ^1_{\tau :}$ is linear at t in $b_{\theta ^1_t}$, thus (in particular) in $\sigma ^{c,1}_t(\cdot |\theta ^1_t)$; and
the optimal value function is LC at t also in $b_{\theta ^1_t}$ (with the same depth-dependent upper-bounding Lipschitz constant $\lambda _t$ as in the proof of Theorem 5),^{Footnote 6} thus (in particular) in $\sigma ^{c,1}_t(\cdot |\theta ^1_t)$.

Using $t=\tau $, the optimal value function is $\nu ^2_{[\sigma ^{c,1}_{\tau }, \psi ^2_{\tau }]}(\theta ^1_{\tau })$, which is thus $\lambda _\tau $-LC in $\sigma ^{c,1}_{\tau }(\cdot |\theta ^1_\tau )$. $\square $

1.2 D.2 Bounding Approximations of $V^$, $W^{1,}$ and $W^{2,*}$

1.2.1 D.2.1 $\overline{V}_\tau $ and $\underline{V}_\tau $

To find a form that could be appropriate for an upper bound approximation of $V^*_\tau $, let us consider an os$\sigma _\tau $ and a single tuple $\langle { {\tilde{\sigma }}_\tau , \nu ^2_{[{\tilde{\sigma }}_\tau ^{c,1}, \beta ^2_{\tau :}]} } \rangle $, and define $\zeta _\tau {\mathop {=}\limits ^{\tiny {\text {def}}}}\sigma _\tau ^{m,1}{\tilde{\sigma }}_\tau ^{c,1}$. Then,

$$\begin{aligned} V^*(\sigma _\tau )&\le V^*(\zeta _\tau ) + \lambda _{\tau } \Vert \sigma _\tau - \zeta _\tau \Vert _1&(\text {LC}, { cf.}\, \text {Theorem 5}) \\&= V^*(\sigma _\tau ^{m,1} {\tilde{\sigma }}_\tau ^{c,1}) + \lambda _{\tau } \Vert \sigma _\tau - \zeta _\tau \Vert _1 \\&\le \sigma _\tau ^{m,1} \cdot \nu ^2_{[{\tilde{\sigma }}_\tau ^{c,1}, \beta ^2_{\tau :}]} + \lambda _{\tau } \Vert \sigma _\tau - \sigma _\tau ^{m,1}{\tilde{\sigma }}_\tau ^{c,1} \Vert _1.&(\text {Cvx}, { cf.} \, \text {Theorem 2}) \end{aligned}$$

Notes:

${\tilde{\sigma }}^{m,1}_\tau $ does not appear in the resulting upper bound, thus will not need to be specified.
For $\tau =H-1$, $\nu ^2_{[{\tilde{\sigma }}_\tau ^{c,1}, \beta _{\tau :}^2]}$ is a simple function of r, ${\tilde{\sigma }}_\tau ^{c,1}$, $\beta _{\tau :}^2$, and the dynamics of the system, as described in Eq. (9) of [47].

From this, we can deduce the following appropriate forms of upper and (symmetrically) lower bound function approximations for $V^*_\tau $:

$$\begin{aligned} \overline{V}_\tau (\sigma _\tau )&= \min _{ \langle {\tilde{\sigma }}_\tau ^{c,1}, \langle \overline{\nu }^2_\tau , \beta _{\tau :}^2 \rangle \rangle \in \overline{{{\mathcal {J}}}}_\tau } \left[ \sigma _\tau ^{m,1} \cdot \overline{\nu }^2_\tau + \lambda _{\tau } \Vert \sigma _\tau - \sigma _\tau ^{m,1}{\tilde{\sigma }}_\tau ^{c,1} \Vert _1 \right] , \text { and} \end{aligned}$$

(D10)

$$\begin{aligned} \underline{V}_\tau (\sigma _\tau )&= \max _{ \langle {\tilde{\sigma }}_\tau ^{c,2}, \langle \underline{\nu }^1_\tau , \beta _{\tau :}^1 \rangle \rangle \in \underline{{{\mathcal {J}}}}_\tau } \left[ \sigma _\tau ^{m,2} \cdot \underline{\nu }^1_\tau - \lambda _{\tau } \Vert \sigma _\tau - \sigma _\tau ^{m,2}{\tilde{\sigma }}_\tau ^{c,2} \Vert _1 \right] , \end{aligned}$$

(D11)

which are, respectively, concave in $\sigma ^{m,1}_\tau $ and convex in $\sigma ^{m,2}_\tau $, and which both exploit the Lipschitz continuity.

1.2.2 D.2.2 ${\overline{W}}_{\tau }$ and ${\underline{W}}_{\tau }$

Note: We discuss all depths from 0 to $H-1$, even though we do not need these approximations at $\tau =H-1$.

Let us first see how concavity-convexity properties affect $W_\tau ^{*,1}$.

Lemma 8

Considering that vectors $\nu ^2_{[\sigma _{H}^{c,1},\beta _{H:}^2]}$ are null vectors, we have, for all :

$$\begin{aligned} W_\tau ^{1,*}(\sigma _\tau ,\beta ^1_\tau )&= \min _{\beta ^2_\tau , \langle \beta _{\tau +1:}^2, \nu ^2_{{[T^{1}_c(\sigma _\tau , \beta ^2_\tau ), \beta ^2_{\tau +1:}]}} \rangle } \beta ^1_\tau \cdot \Big [ { r(\sigma _\tau , \cdot , \beta ^2_\tau ) } \\&\qquad + {\gamma T^{1}_m(\sigma _\tau , \cdot , \beta ^2_\tau ) \cdot \nu ^2_{{[T^{1}_c(\sigma _\tau , \beta ^2_\tau ),\beta ^2_{\tau +1:}]}} } \Big ]. \end{aligned}$$

Proof

Considering that vectors $\nu ^2_{[\sigma _{H}^{c,1},\beta ^2_{H:}]}$ are null vectors, we have, for all :

$$\begin{aligned} W_\tau ^{1,*}(\sigma _\tau ,\beta ^1_\tau )&= \min _{\beta ^2_\tau } Q^*_\tau (\sigma _\tau , \beta ^1_\tau , \beta ^2_\tau ) = \min _{\beta ^2_\tau } \left[ r(\sigma _\tau , {\varvec{\beta }}_\tau ) + \gamma V^*_{\tau +1}( T(\sigma _\tau , {\varvec{\beta }}_\tau ) ) \right] \end{aligned}$$

(Line below exploits Theorem 2 (p. 8) and $T^{1}_c$’s independence from $\beta ^1_\tau $ (Lemma 4).)

$$\begin{aligned}&= \min _{\beta ^2_\tau } \left[ r(\sigma _\tau , {\varvec{\beta }}_\tau ) + \gamma \min _{\langle \beta _{\tau +1:}^2, \nu ^2_{{[T^{1}_c(\sigma _\tau , \beta ^2_\tau ), \beta ^2_{\tau +1:}]}} \rangle } \left[ T^{1}_m(\sigma _\tau , {\varvec{\beta }}_\tau ) \cdot \nu ^2_{{[T^{1}_c(\sigma _\tau , \beta ^2_\tau ), \beta ^2_{\tau +1:}]}} \right] \right] \\&\quad = \min _{\beta ^2_\tau , \langle \beta _{\tau +1:}^2, \nu ^2_{{[T^{1}_c(\sigma _\tau , \beta ^2_\tau ), \beta ^2_{\tau +1:}]}} \rangle } \left[ r(\sigma _\tau , {\varvec{\beta }}_\tau ) + \gamma T^{1}_m(\sigma _\tau , {\varvec{\beta }}_\tau ) \cdot \nu ^2_{{[T^{1}_c(\sigma _\tau , \beta ^2_\tau ), \beta ^2_{\tau +1:}]}} \right] \end{aligned}$$

(Line below exploits r and $T^{1}_m$’s linearity in $\beta ^1_\tau $ (Lemma 3).)

$$\begin{aligned}&= \min _{\beta ^2_\tau , \langle \beta _{\tau +1:}^2, \nu ^2_{{[T^{1}_c(\sigma _\tau , \beta ^2_\tau ), \beta ^2_{\tau +1:}]}} \rangle } {\beta ^1_\tau }^{\top } \! \cdot \Big [ { r(\sigma _\tau , \cdot , \beta ^2_\tau ) }\\&\quad + \gamma T^{1}_m(\sigma _\tau , \cdot , \beta ^2_\tau ) \cdot \nu ^2_{{[T^{1}_c(\sigma _\tau , \beta ^2_\tau ),\beta ^2_{\tau +1:}]}} \Big ]. \end{aligned}$$

$\square $

Note that, since $V^*_H=0$, $\tau =H-1$ is a particular case which can be simply re-written:

$$\begin{aligned} W_\tau ^{1,*}(\sigma _\tau ,\beta ^1_\tau )&= \min _{\beta ^2_\tau } {\beta ^1_\tau }^{\top } \! \cdot r(\sigma _\tau , \cdot , \beta ^2_\tau ). \end{aligned}$$

To find a form that could be appropriate for an upper bound approximation of $W^{*,1}_\tau $, let us now consider an os$\sigma _\tau $ and a single tuple $\langle { {\tilde{\sigma }}_\tau , {\tilde{\beta }}^2_\tau , \nu ^2_{[T^{1}_c({\tilde{\sigma }}_\tau , {\tilde{\beta }}^2_\tau ), {\tilde{\beta }}_{\tau +1:}^2]} } \rangle $. Then,

$$\begin{aligned} W_\tau ^{1,*}(\sigma _\tau ,\beta ^1_\tau )&= \min _{\beta ^2_\tau } \left[ r(\sigma _\tau , {\varvec{\beta }}_\tau ) + \gamma V^*_{\tau +1}( T(\sigma _\tau , {\varvec{\beta }}_\tau ) ) \right] \\&\le r(\sigma _\tau , \beta ^1_\tau , {\tilde{\beta }}^2_\tau ) + \gamma V^{BR,1}_{\tau +1}( T(\sigma _\tau , \beta ^1_\tau , {\tilde{\beta }}^2_\tau ) | {\tilde{\beta }}_{\tau +1:}^2) \end{aligned}$$

(Use ${\tilde{\beta }}^2_\tau $ & ${\tilde{\beta }}^2_{\tau +1:}$ instead of mins)

(where $V^{BR,1}_{\tau +1}( T(\sigma _\tau , \beta ^1_\tau , {\tilde{\beta }}^2_\tau ) | {\tilde{\beta }}_{\tau +1:}^2)$ is the value of 1’s best response to ${\tilde{\beta }}^2_{\tau +1:}$ if in $T(\sigma _\tau , \beta ^1_\tau , {\tilde{\beta }}^2_\tau )$)

$$\begin{aligned}&= r(\sigma _\tau , \beta ^1_\tau , {\tilde{\beta }}^2_\tau ) + \gamma T^{1}_m(\sigma _\tau , \beta ^1_\tau , {\tilde{\beta }}^2_\tau ) \cdot \underbrace{ \nu ^2_{[T^{1}_c(\sigma ^{c,1}_\tau , {\tilde{\beta }}^2_\tau ), {\tilde{\beta }}_{\tau +1:}^2]} } \end{aligned}$$

(Lemma 3 of [47])

(D12)

(Lemma 7: $\lambda _{\tau +1}$-LC of $\nu ^2_{[T^{1}_c(\sigma ^{c,1}_\tau , {\tilde{\beta }}^2_\tau ), {\tilde{\beta }}_{\tau +1:}^2]}$)

(D13)

(Linearity in $\beta ^1_\tau $)

$$\begin{aligned}&= {\beta ^1_\tau }^{\top } \! \cdot \Big [ r(\sigma _\tau , \cdot , {\tilde{\beta }}^2_\tau ) + \gamma T^{1}_m(\sigma _\tau , \cdot , {\tilde{\beta }}^2_\tau ) \cdot { \nu ^2_{[T^{1}_c({\tilde{\sigma }}^{c,1}_\tau , {\tilde{\beta }}^2_\tau ), {\tilde{\beta }}_{\tau +1:}^2]} }\\&\quad { + \gamma \lambda _{\tau +1} \cdot \Vert T(\sigma _\tau , \cdot , {\tilde{\beta }}^2_\tau ) - T^{1}_m(\sigma _\tau , \cdot , {\tilde{\beta }}^2_\tau ) T^{1}_c({\tilde{\sigma }}^{c,1}_\tau , {\tilde{\beta }}^2_\tau ) \Vert _1 } \Big ] \end{aligned}$$

(Alternative writing)

From this, we can deduce the following appropriate forms of (i) upper bounding approximation for $W^{1,*}_\tau $ and (ii) (symmetrically) of lower bound approximation for $W^{2,*}_\tau $:

$$\begin{aligned} {\overline{W}}_{\tau }(\sigma _\tau , \beta ^1_\tau )&= \min _{ \langle {\tilde{\sigma }}^{c,1}_\tau , \beta ^2_\tau , \overline{\nu }^2_{\tau +1} \rangle \in \overline{\mathcal {I}}_\tau } {\beta ^1_\tau }^{\top } \cdot \Big [ { r(\sigma _\tau , \cdot , \beta ^2_\tau ) + \gamma T^{1}_m(\sigma _\tau , \cdot , \beta ^2_\tau ) \cdot \overline{\nu }^2_{\tau +1} } \\&\quad + \gamma \lambda _{\tau +1} \cdot \Vert T(\sigma _\tau , \cdot , \beta ^2_\tau ) - T^{1}_m(\sigma _\tau , \cdot , \beta ^2_\tau ) T^{1}_c({\tilde{\sigma }}^{c,1}_\tau , \beta ^2_\tau ) \Vert _1 \Big ], \text { and} \\ {\underline{W}}_{\tau }(\sigma _\tau , \beta ^2_\tau )&= \max _{ \langle {\tilde{\sigma }}^{c,2}_\tau , \beta ^1_\tau , \underline{\nu }^1_{\tau +1} \rangle \in \underline{\mathcal {I}}_\tau } {\beta ^2_\tau }^{\top } \cdot \Big [ { r(\sigma _\tau , \beta ^1_\tau , \cdot ) + \gamma T^{2}_m(\sigma _\tau , \beta ^1_\tau , \cdot ) \cdot \underline{\nu }^1_{\tau +1} } \\&\quad - \gamma \lambda _{\tau +1} \cdot \Vert T(\sigma _\tau , \beta ^1_\tau , \cdot ) - T^{2}_m(\sigma _\tau , \beta ^1_\tau , \cdot ) T^{2}_c({\tilde{\sigma }}^{c,2}_\tau , \beta ^1_\tau ) \Vert _1 \Big ], \end{aligned}$$

where $\overline{\nu }^2_{\tau +1}$ and $\underline{\nu }^1_{\tau +1}$, respectively, upper and lower bound the actual vectors associated to the players’ future strategies (resp. of 2 and 1).

Again, $\tau =H-1$ is a particular case where only the reward term is preserved.

This constitutes the proof to the following proposition

Proposition 13

(originally stated on page 11) Let $\overline{{{\mathcal {I}}}}_\tau $ be a set of tuples $w=\langle {\tilde{\sigma }}_\tau , \beta _{\tau }^2, \langle \overline{\nu }_{\tau +1}^2,$ $\beta _{\tau +1:}^2 \rangle \rangle $. Then,

$$\begin{aligned} {\overline{W}}_{\tau }(\sigma _\tau , \beta _\tau ^1)&{\mathop {=}\limits ^{\tiny {\text {def}}}}\min _{\langle {\tilde{\sigma }}_\tau , \beta _{\tau }^2, \langle \overline{\nu }_{\tau +1}^2, \beta _{\tau +1:}^2 \rangle \rangle \in \overline{{{\mathcal {I}}}}_\tau } \left[ r(\sigma _\tau , \beta _\tau ^1, \beta ^2_\tau ) + \gamma T^{1}_m(\sigma _\tau , \beta _\tau ^1, \beta ^2_\tau ) \cdot \overline{\nu }^2_{\tau +1} \right. \nonumber \\&\quad \left. + \lambda _{\tau +1} \Vert T(\sigma _\tau , \beta _\tau ^1, \beta ^2_\tau ) - T^{1}_m(\sigma _\tau , \beta _\tau ^1, \beta ^2_\tau ) T^{1}_c({\tilde{\sigma }}^{c,1}_\tau , \beta ^2_\tau ) \Vert _1 \right] \end{aligned}$$

(6)

upper-bounds $W_\tau ^{1,*}$ over the whole space ${{\mathcal {O}}}^\sigma _\tau \times {{\mathcal {B}}}_{\tau }^1$.

1.3 D.3 Related Operators

1.3.1 D.3.1 Selection Operator: Solving for $\beta ^1_\tau $ as an LP

Proposition 14

Using now a distribution $\psi ^2_\tau $ over tuples $w = \langle {\tilde{\sigma }}^{c,1}_\tau , \beta ^2_\tau , \overline{\nu }^2_{\tau +1} \rangle \in \overline{\mathcal {I}}^1_\tau $, the corresponding upper-bounding value for “profile” $\langle \beta ^1_\tau , \psi ^2_\tau \rangle $ when in $\sigma _\tau $ can be written as an expectancy:

$$\begin{aligned} {\beta ^1_\tau }^{\top } \cdot M^{\sigma _\tau } \cdot \psi ^2_\tau , \end{aligned}$$

where $M^{\sigma _\tau }$ is an $|\Theta ^1_\tau \times {{\mathcal {A}}}^1| \times |\overline{\mathcal {I}}^1_\tau |$ matrix.

Proof

From the right-hand side term in (Eq. D13), the upper-bounding value associated to $\sigma _\tau $, $\beta ^1_\tau $ and a tuple $\langle {\tilde{\sigma }}^{c,1}_\tau , \beta ^2_\tau , \overline{\nu }^2_{\tau +1} \rangle \in \overline{\mathcal {I}}^1_\tau $ can be written:

(where x[w] denotes the field x of tuple w)

$$\begin{aligned}&= {\beta ^1_\tau }^{\top } \cdot M^{\sigma _\tau } \cdot \psi ^2_\tau , \end{aligned}$$

where $M^{\sigma _\tau }$ is an $|\Theta ^1_\tau \times {{\mathcal {A}}}^1| \times |\overline{\mathcal {I}}^1_\tau |$ matrix. $\square $

For implementation purposes, using Eqs. (2) and (D9) (to develop, respectively, $r(\cdot ,\cdot ,\cdot )$ and $T^{1}_m(\cdot ,\cdot ,\cdot )$), we can derive the expression of a component, i.e., the upper-bounding value if $a^1$ is applied in $\theta ^1_\tau $ while w is chosen:

Then, solving $\max _{\beta ^1_\tau } {\overline{W}}_{\tau }(\sigma _\tau , \beta ^1_\tau )$ can be rewritten as solving a zero-sum game where pure strategies are:

for Player 1, the choice of not 1, but $|\Theta ^1_\tau |$ actions (among $|{{\mathcal {A}}}^1|$) and,
for Player 2, the choice of 1 element of $\overline{\mathcal {I}}^1_\tau $.

One can view it as a Bayesian game with one type per history $\theta ^1_\tau $ for 1, and a single type for 2.

With our upper bound, $\max _{\beta ^1_\tau } {\overline{W}}_{\tau }(\sigma _\tau ,\beta ^1_\tau )$ can thus be solved as the LP:

$$\begin{aligned}&\begin{array}{llll} \displaystyle \max _{\beta _\tau ^1,v} v \quad \text {s.t. } &{} \text {(i)} &{} \forall w \in \overline{\mathcal {I}}^1_\tau , &{} v \le {\beta _\tau ^1}^{\top } \! \cdot M^{\sigma _\tau }_{(\cdot ,w)} \\ &{} \text {(ii)} &{} \forall \theta _\tau ^1 \in \Theta _\tau ^1, &{} {\displaystyle \sum _{a^1}} \beta _\tau ^1(a^1|\theta _\tau ^1) = 1, \end{array} \end{aligned}$$

whose dual LP is given by

$$\begin{aligned}&\begin{array}{llll} \displaystyle \min _{\psi ^2_\tau ,v} v \quad \text {s.t. } &{} \text {(i)} &{} \forall (\theta ^1_\tau , a^1 ) \in \Theta ^1_\tau \times {{\mathcal {A}}}^1, &{} v \ge M^{\sigma _\tau }_{((\theta ^1_\tau ,a^1),\cdot )} \cdot \psi ^2_\tau \\ &{} \text {(ii)} &{} &{} {\displaystyle \sum _{w \in \overline{\mathcal {I}}^1_\tau }} \! \psi ^2_\tau ( w ) = 1. \end{array} \end{aligned}$$

As can be noted, $M^{\sigma _\tau }$’s columns corresponding to 0-probability histories $\theta ^1_\tau $ in $\sigma ^{m,1}_\tau $ are empty (full of zeros), so that the corresponding decision rules (for these histories) are not relevant and can be set arbitrarily. The actual implementation thus ignores these histories, whose corresponding decision rules also do not need to be stored.

Remark 2

(Interpretation of $M^{\sigma _\tau }$) The content of this matrix can be interpreted by noting that, a given w containing a behavioral strategy $\beta ^2_{\tau :H-1}$ and an os${\tilde{\sigma }}_\tau $, a pair $\langle w, \theta ^1_\tau \rangle $ induces a POMDP for player 1 whose state space is made of pairs $\langle s, \theta ^2_t\rangle $, and whose initial belief $b_\tau $ depends on ${\tilde{\sigma }}_\tau $ and $\theta _\tau ^1$. Solving this POMDP amounts to finding a best response of player 1 to $\beta _{\tau :H-1}^2$. In this setting, an element $M^{\sigma _\tau }_{((\theta _\tau ^1,a^1),w)}$ is an upper-bound of the optimal (POMDP) Q-value when player 1 performs $a^1$ while facing $b_\tau $ ($Q^*_{\text {POMDP}}(b_\tau ,a^1)$).

1.3.2 D.3.2 Upper Bounding $\nu ^2_{[\sigma ^{c,1}_\tau ,\psi ^{2}_{\tau }]}$

Adding a new complete tuple to $\overline{\mathcal {I}}^1_\tau $ requires a new vector $\overline{\nu }^2_{\tau }$ that upper bounds the vector $\nu ^2_{[\sigma ^{c,1}_\tau ,\psi ^2_\tau ]}$ associated to the strategy induced by $\psi ^2_\tau $. We can obtain one in a recursive manner (not solving the induced POMDP).

Proposition 15

For each $\psi ^2_\tau $ obtained as the solution of the aforementioned (dual) LP in $\sigma _\tau $, and each $\theta ^1_\tau $, $\nu ^2_{[\sigma ^{c,1}_{\tau }, \psi ^2_{\tau }]}(\theta ^1_\tau )$ is upper-bounded by a value $\overline{\nu }^2_{\tau }(\theta ^1_\tau )$ that depends on vectors $\overline{\nu }^2_{\tau +1}$ in the support of $\psi ^2_\tau $. In particular, if $\theta ^1_\tau \in Supp (\sigma ^{m,1}_\tau )$, we have:

$$\begin{aligned} \overline{\nu }^2_{\tau }(\theta ^1_\tau )&{\mathop {=}\limits ^{\tiny {\text {def}}}}\frac{1}{\sigma ^{1}_{\tau ,m}(\theta ^1_\tau )} \max _{a^1 \in {{\mathcal {A}}}^1} M^{\sigma _\tau }_{((\theta ^1_\tau , a^1), . )} \cdot \psi ^2_\tau . \end{aligned}$$

Proof

For a newly derived $\psi ^2_\tau $, as $\nu ^2_{[\sigma ^{c,1}_\tau , \psi ^2_\tau ]}(\theta ^1_\tau )$ is the value of 1’s best action ($\in {{\mathcal {A}}}^1$) if 1 (i) observes $\theta ^1_\tau $, while in $\sigma ^{c,1}_\tau $ and (ii) 2 plays $\psi ^2_\tau $, we have:

$$\begin{aligned}&\nu ^2_{[\sigma ^{c,1}_\tau , \psi ^2_\tau ]}(\theta ^1_\tau ) {\mathop {=}\limits ^{\tiny {\text {def}}}}V^\star _{[\sigma ^{c,1}_\tau , \psi ^2_\tau ]}(\theta ^1_\tau ) \qquad \text {(optimal POMDP value function)}\\&\quad = \max _{\beta ^1_{\tau :}} {{\,\mathrm{\mathbb {E}}\,}}\left[ \sum _{t=\tau }^H \gamma ^{t-\tau } R_t \mid \beta ^1_{\tau :}, \theta ^1_\tau , \sigma ^{c,1}_\tau , \psi ^2_\tau \right] \\&\quad = \max _{a^1} {{\,\mathrm{\mathbb {E}}\,}}\left[ R_\tau + \gamma \max _{\beta ^1_{\tau +1:}} {{\,\mathrm{\mathbb {E}}\,}}\Bigg [ \sum _{t=\tau +1}^H \gamma ^{t-(\tau +1)} R_t \mid \beta ^1_{\tau +1:}, \langle \theta ^1_\tau , a^1, Z^1 \rangle , \sigma ^{c,1}_{\tau +1}, \psi ^2_{\tau +1} \right] \\&\qquad \Bigg \vert a^1, \theta ^1_\tau , \sigma ^{c,1}_\tau , \psi ^2_\tau \Bigg ] \\&\quad = \max _{a^1} {{\,\mathrm{\mathbb {E}}\,}}\left[ R_\tau + \gamma V^\star _{[\sigma ^{c,1}_{\tau +1}, \psi ^2_{\tau +1}]}( \theta ^1_\tau , a^1, Z^1 ) \mid a^1, \theta ^1_\tau , \sigma ^{c,1}_\tau , \psi ^2_\tau \right] \\&\quad = \max _{a^1} \sum _{w, \theta ^2_\tau , a^2, z^1} \underbrace{Pr(w, \theta ^2_\tau , z^1, a^2 \mid a^1, \theta ^1_\tau , \sigma ^{c,1}_\tau , \psi ^2_\tau )} \\&\qquad \cdot \left( r({\varvec{\theta }}_\tau ,{\varvec{a}}_\tau ) + \gamma \nu ^2_{[\sigma ^{c,1}_{\tau +1}, \psi ^2_{\tau +1}[w]]}( \theta ^1_\tau , a^1, z^1 ) \right) \end{aligned}$$

(where $\sigma ^{c,1}_{\tau +1}=T^{1}_c(\sigma ^{c,1}_\tau , \beta ^2_\tau [w])$ (Lemma 4, p. 39)

$$\begin{aligned}&= \max _{a_1} \sum _{w, \theta ^2_\tau , a^2, z^1} \underbrace{Pr(w | \psi ^2_\tau )} \cdot \underbrace{Pr(\theta ^2_\tau | \theta ^1_\tau , \sigma ^{c,1}_\tau )} \cdot \underbrace{Pr( a^2 | \beta ^2_\tau [w], \theta ^2_\tau )} \cdot \underbrace{Pr( z^1 | {\varvec{\theta }}_\tau , {\varvec{a}}_\tau )} \\&\quad \cdot \left( r({\varvec{\theta }}_\tau , {\varvec{a}}) + \gamma \nu ^2_{[\sigma ^{c,1}_{\tau +1}, \psi ^2_{\tau +1}[w]]}(\theta ^1_\tau , a^1, z^1) \right) \\&= \max _{a_1} \sum _w \psi ^2_\tau (w) \sum _{\theta ^2_\tau } \sigma ^{c,1}_\tau (\theta ^2_\tau |\theta ^1_\tau ) \sum _{a^2} \beta ^2_\tau [w](a^2|\theta ^2_\tau ) \\&\quad \cdot \left( r({\varvec{\theta }}_\tau , {\varvec{a}}) + \gamma \sum _{z^1} Pr(z^1|{\varvec{\theta }}_\tau ,{\varvec{a}}) \underbrace{\nu ^2_{[\sigma ^{c,1}_{\tau +1}, \psi ^2_{\tau +1}[w]]}(\theta ^1_\tau , a^1, z^1)} \right) \end{aligned}$$

then, as $\nu ^2_{[\sigma ^{c,1}_{\tau +1}, \psi ^2_{\tau +1}[w]]}$ is $\lambda _{\tau +1}$-LC in (any) $\sigma ^{c,1}_{\tau +1}$ (Lemma 7),

$\square $

1.3.3 D.3.3 Strategy Conversion

Firstly, we give details regarding solutions of Dual LPs (Eq. 7) inducing behavioral strategies. As suggested in Sect. 3.2.2, one can show by induction that for any timestep $\tau $, each $\psi _{\tau }^2$ is actually equivalent to an element of $\Delta ({{\mathcal {B}}}_{\tau :}^2)$. The following lemma shows that for any timestep $\tau $, each element of $\Delta ({{\mathcal {B}}}_{\tau :}^2)$ induces an element of ${{\mathcal {B}}}_{\tau :}^2$.

Lemma 9

Each $\psi _{\tau }^i \in \Delta {({{\mathcal {B}}}_{\tau :}^i)}$ induces a behavioral strategy. More precisely, we prove that (i) there is a natural injection from the set ${{\mathcal {B}}}_{\tau :}^i$ to the set of distributions $\Delta ({{\mathcal {B}}}_{\tau :}^i)$ and (ii) there is a surjection from the set $\Delta ({{\mathcal {B}}}_{\tau :}^i)$ to ${{\mathcal {B}}}_{\tau :}^i$.

Proof

By induction on $\tau \in \{0,\dots ,H-1\}$, we prove that $\cup _{t=\tau }^H {{\mathcal {B}}}_t^i \subset \cup _{t=\tau }^H \Delta ({{\mathcal {B}}}^i_t)$. Firstly, for $\tau = H-1$, for all $ \beta _{H-1} \in {{\mathcal {B}}}_{H-1}$, one can pick the degenerate distribution $\psi _{H-1} = \beta _{H-1}$ which is in $\Delta ({{\mathcal {B}}}^i_{H-1})$. Next, assume that $\cup _{t=\tau +1}^{H-1} {{\mathcal {B}}}_{t}^i \subset \cup _{t=\tau +1}^{H-1} \Delta ({{\mathcal {B}}}^i_{t})$ for some $\tau \in \{0,\dots ,H-2\}$, then for all $\beta _{\tau :H-1},\ \beta _{\tau :H-1} = \beta _\tau \oplus \beta _{\tau +1:H-1}$. By the induction hypothesis, there is $\psi _{\tau +1:H-1} \in \Delta ({{\mathcal {B}}}_{\tau +1:H-1})$ equal to $\beta _{\tau +1:H-1}$. Thus, we define $\psi _{\tau :H-1} = \beta _{\tau } \oplus \psi _{\tau +1:H-1} = \beta _\tau \oplus \beta _{\tau +1:H-1} = \beta _{\tau :H-1}$ which is in ${{\mathcal {B}}}_{\tau :H-1}$. From this follows a natural injection from the behavioral strategies’ set to the set of distributions over behavioral strategies.

The surjection from $\Delta ({{\mathcal {B}}}_{\tau :}^i)$ to ${{\mathcal {B}}}_{\tau :}^i$ is given by the realization weight computation algorithm detailed in Algorithm 2. $\square $

As discussed in Appendix D.3.3, no effort is required to extract a solution strategy for a player from the lower bound (for 1) or the upper bound (for 2), but that strategy is in an unusual recursive form. We will here see (in the finite horizon setting) how to derive a (unique) equivalent behavioral strategy $\beta ^i_{0:}$ using realization weights [23] in intermediate steps. To that end, we first define these realization weights in the case of a behavioral strategy (rather than for a mixed strategy as done by Koller et al.) and present some useful properties.

About Realization Weights

Let us denote $rw^i(a^i_0, z^i_1, a^i_1, \dots , a^i_\tau )$ the realization weight (RW) of sequence $a^i_0, z^i_1, a^i_1,$ $\dots , a^i_\tau $ under strategy $\beta ^i_{0:}$, defined as

$$\begin{aligned} rw^i(a^i_0, z^i_1, a^i_1, \dots , a^i_\tau )&{\mathop {=}\limits ^{\tiny {\text {def}}}}\prod _{t=0}^\tau \beta ^i_{0:}(a^i_t | a^i_0, z^i_1, a^i_1, \dots , z^i_t) \end{aligned}$$

(D14)

$$\begin{aligned}&= rw^i(a^i_0, z^i_1, a^i_1, \dots , a^i_{\tau -1}) \cdot \beta ^i_{0:}(a^i_\tau | \underbrace{a^i_0, z^i_1, a^i_1, \dots , z^i_\tau }_{\theta ^i_\tau }). \end{aligned}$$

(D15)

This definition already leads to useful results such as:

$$\begin{aligned} \beta ^i_{0:}(a^i_\tau | \theta ^i_\tau )&= \frac{ rw^i(\theta ^i_{\tau -1}, a^i_{\tau -1}, z^i_\tau , a^i_\tau ) }{ rw^i(\theta ^i_{\tau -1}, a^i_{\tau -1}) }, \end{aligned}$$

(D16)

and

$$\begin{aligned} \forall z^i_\tau , \quad rw^i(\theta ^i_{\tau -1}, a^i_{\tau -1})&= rw^i(\theta ^i_{\tau -1}, a^i_{\tau -1}) \cdot \underbrace{\sum _{a^i_\tau } \beta (a^i_\tau | \theta ^i_{\tau -1}, a^i_{\tau -1}, z^i_\tau )}_{=1} \end{aligned}$$

(D17)

$$\begin{aligned}&= \sum _{a^i_\tau } rw^i(\theta ^i_{\tau -1}, a^i_{\tau -1}) \cdot \beta (a^i_\tau | \theta ^i_{\tau -1}, a^i_{\tau -1}, z^i_\tau ) \end{aligned}$$

(D18)

$$\begin{aligned}&= \sum _{a^i_\tau } rw^i(\theta ^i_{\tau -1}, a^i_{\tau -1}, z^i_\tau , a^i_\tau ). \end{aligned}$$

(D19)

We now extend Koller et al.’s definition by introducing conditional realization weights, where the realization weight of a suffix sequence is “conditioned” on a prefix sequence:

$$\begin{aligned}&rw^i(\underbrace{a^i_\tau , \dots , a^i_{\tau '}}_{\text {suffix seq.}} | \underbrace{a^i_0, \dots , z^i_\tau }_{\text {prefix seq.}}) {\mathop {=}\limits ^{\tiny {\text {def}}}}\prod _{t=\tau }^{\tau '} \beta ^i_{0:}(a^i_t | a^i_0, \dots , z^i_\tau , a^i_\tau , \dots , z^i_t) \end{aligned}$$

(D20)

$$\begin{aligned}&= \beta ^i_{0:}(a^i_{\tau } | a^i_0, \dots , z^i_\tau ) \cdot rw^i(a^i_{\tau +1}, \dots , a^i_{\tau '} | a^i_0, \dots , z^i_{\tau +1}). \end{aligned}$$

(D21)

As can be noted, this definition only requires the knowledge of a partial strategy $\beta ^i_{\tau :}$ rather than a complete strategy $\beta ^i_{0:}$.

Mixing Realization Weights

Let $\tau '\ge \tau +1$, and $rw^i[w]$ denote the realization weights of some element w at $\tau +1$. Then, for some $\psi ^i_\tau $, we have

$$\begin{aligned}&rw[\psi ^i_\tau ](a^i_{\tau +1}, \dots , a^i_{\tau '} | a^i_0, \dots , z^i_{\tau +1}) \end{aligned}$$

(D22)

$$\begin{aligned}&= \sum _w \psi ^i_\tau (w) \cdot rw[w](a^i_{\tau +1}, \dots , a^i_{\tau '} | a^i_0, \dots , z^i_{\tau +1}). \end{aligned}$$

(D23)

From $w^i_0$ to $\beta ^i_{0:}$

Using the above results, function Extract in Algorithm 2 derives a behavioral strategy $\beta ^i_{0:}$ equivalent to the recursive strategy induced by some tuple $w^i_0$ in 3 steps as follows:

1.
From $w^i_0$ to $rw(\theta ^i_{0:H-1}, a^i_{H-1})$ ($\forall (\theta ^i_{0:H-1}, a^i_{H-1})$) — These (classical) realization weights are obtained by recursively going through the directed acyclic graph describing the recursive strategy, computing full length (conditional) realization weights $rw(\theta ^i_{t:H-1}, a^i_{H-1} | \theta ^i_{0:t})$ (for $t=H-1$ down to 0). When in a leaf node, at depth $H-1$, the initialization is given by Eq. D20 when $\tau =\tau '=H-1$:
$$\begin{aligned} rw^i(a^i_{H-1} | a^i_0, \dots , z^i_{H-1})&{\mathop {=}\limits ^{\tiny {\text {def}}}}\prod _{t={H-1}}^{{H-1}} \beta ^i(a^i_t | a^i_0, \dots , z^i_t) \\&= \beta ^i(a^i_{H-1} | a^i_0, \dots , z^i_{H-1}). \end{aligned}$$
Then, in the backward phase, we can compute full length realization weights $rw(\theta ^i_{t+1:H-1}, a^i_{H-1} | \theta ^i_{0:t})$ with increasingly longer suffixes (thus shorter prefixes) using (i) Eq. D23 (in function FRecGetRWMix, line 25) to “mix” several strategies using the distribution $\psi ^i_t$ attached to the current w, and (ii) Eq. D21, with $\tau '=H-1$, (in function FRecGetRWCat, line 16) to concatenate the behavioral decision rule $\beta ^i_t$ attached to the current w in front of the strategy induced by the distribution $\psi ^i_t$ also attached to w. Note: Memoization can here be used to avoid repeating the same computations.
2.
Retrieving (classical) realization weights $rw(\theta ^i_{0:t}, a^i_t | -)$ ($\forall t$) — We can now compute realization weights $rw(\theta ^i_{0:t}, a^i_t | -)$ for all t’s using Eq. D19 (line 6).
3.
Retrieving behavioral decision rules $\beta ^i_t$ — Applying Equation D16 (line 9) then provides the expected behavioral decision rules.

In practice, lossless compressions are used to reduce the dimensionality of the occupancy state (cf.Sect. 4.1.2), which are currently lost in the current implementation of the conversion. Ideally, one would like to preserve compressions whenever possible or at least retrieve them afterward, and possibly identify further compressions in the solution strategy.

Appendix E HSVI for zs-POSGs

This section presents results that help (i) tune zs-oMG-HSVI’s radius parameter $\rho $, ensuring that trajectories will always stop, and (ii) then demonstrate the finite time convergence of this algorithm.

1.1 E.1 Algorithm

1.1.1 E.1.1 Setting $\rho $

Proposition 16

Bounding $\lambda _{\tau }$ by $\lambda ^{\infty } = \frac{1}{2} \frac{1}{1-\gamma } \left[ r_{\max } - r_{\min } \right] $ when $\gamma <1$, and noting that

$$\begin{aligned} thr(\tau )&= \gamma ^{-\tau }\epsilon - 2 \rho \lambda ^\infty \frac{\gamma ^{-\tau }-1}{1-\gamma } \quad \text {if } \gamma <1 \nonumber \\&\text {(} = \epsilon - \rho (r_{\max }-r_{\min }) (2H + 1 - \tau ) \tau \quad \text {if } \gamma =1 \text {)}, \end{aligned}$$

(E24)

one can ensure positivity of the threshold at any by enforcing

$0< \rho < \frac{1-\gamma }{2\lambda ^\infty }\epsilon $ (or $0< \rho <\frac{\epsilon }{(r_{\max }-r_{\min }) (H + 1)H}$ if $\gamma =1$).

Proof

$\underline{\text {Let us first consider the case}\,\,\gamma <1.}$

We have (for ):

$$\begin{aligned} thr(\tau )&= \gamma ^{-\tau }\epsilon - \sum _{i=1}^\tau 2 \rho \lambda ^\infty \gamma ^{-i} \\&= \gamma ^{-\tau }\epsilon - 2 \rho \lambda ^\infty \sum _{i=1}^\tau \gamma ^{-i} \\&= \gamma ^{-\tau }\epsilon - 2 \rho \lambda ^\infty \left( \gamma ^{-1} + \gamma ^{-2} + \cdots + \gamma ^{-\tau } \right) \\&= \gamma ^{-\tau }\epsilon - 2 \rho \lambda ^\infty \gamma ^{-1} \left( \gamma ^{0} + \gamma ^{-1} + \cdots + \gamma ^{-(\tau -1)} \right) \\&= \gamma ^{-\tau }\epsilon - 2 \rho \lambda ^\infty \gamma ^{-1} \frac{\gamma ^{-\tau }-1}{\gamma ^{-1}-1} \\&= \gamma ^{-\tau }\epsilon - 2 \rho \lambda ^\infty \frac{\gamma ^{-\tau }-1}{1-\gamma }. \end{aligned}$$

Then, let us derive the following equivalent inequalities:

$$\begin{aligned} 0&< thr(\tau ) \\ 2 \rho \lambda ^\infty \frac{\gamma ^{-\tau }-1}{1-\gamma }&< \gamma ^{-\tau }\epsilon \\ \rho&< \frac{1}{2\lambda ^\infty } \frac{1-\gamma }{\gamma ^{-\tau }-1} \gamma ^{-\tau } \epsilon \\ \rho&< \frac{1}{2\lambda ^\infty } \frac{1-\gamma }{1-\gamma ^\tau } \epsilon . \end{aligned}$$

To ensure positivity of the threshold for any $\tau \ge 1$, one thus just needs to set $\rho $ as a positive value smaller than $\frac{1-\gamma }{2\lambda ^\infty }\epsilon $.

$\underline{\text {Let us now consider the case}\,\,\gamma =1.}$

We have (for $\tau \in \{1,\dots ,H-1\}$):

$$\begin{aligned} thr(\tau )&{\mathop {=}\limits ^{\tiny {\text {def}}}}\epsilon - \sum _{i=1}^\tau 2 \rho \lambda _{\tau -i} \\&= \epsilon - \sum _{i=1}^\tau 2 \rho (H-(\tau -i))\cdot (r_{\max }-r_{\min }) \\&= \epsilon - 2 \rho (r_{\max }-r_{\min }) \left[ \tau (H-\tau ) + \sum _{i=1}^\tau i \right] \\&= \epsilon - 2 \rho (r_{\max }-r_{\min }) \left[ \tau H - \tau ^2 + \frac{1}{2}\tau (\tau +1) \right] \\&= \epsilon - 2 \rho (r_{\max }-r_{\min }) \left[ (H+\frac{1}{2}) \tau - \frac{1}{2} \tau ^2 \right] \\&= \epsilon - \rho (r_{\max }-r_{\min }) \left[ (2H+1) \tau - \tau ^2 \right] \\&= \epsilon - \rho (r_{\max }-r_{\min }) \left[ (2H + 1 - \tau ) \tau \right] . \end{aligned}$$

Then, let us derive the following equivalent inequalities:

The function $f: \tau \mapsto \frac{\epsilon }{(r_{\max }-r_{\min }) (2\,H + 1 - \tau ) \tau }$ reaches its minimum (for $\tau \in (0,H+1)$) when $\tau =H+\frac{1}{2}$. To ensure positivity of the threshold for any , one thus just needs to set $\rho $ as a positive value smaller than $\frac{\epsilon }{(r_{\max }-r_{\min }) (H + 1)H}$. $\square $

1.2 E.2 Finite-Time Convergence

1.2.1 E.2.1 Convergence Proof

Proving the finite-time convergence of zs-oMG-HSVI to an error-bounded solution requires some preliminary lemmas.

Lemma 10

Let $(\sigma _0,\dots ,\sigma _{\tau +1})$ be a full trajectory generated by zs-oMG-HSVI and ${\varvec{\beta }}_\tau $ the behavioral dr profile that induced the last transition, i.e., $\sigma _{\tau +1}= T(\sigma _\tau , {\varvec{\beta }}_\tau )$.

Then, after updating ${\overline{W}}_{\tau }$ and ${\underline{W}}_{\tau }$, we have that ${\overline{W}}_{\tau }(\sigma _\tau ,\beta ^1_\tau ) - {\underline{W}}_{\tau }(\sigma _\tau ,\beta ^2_\tau ) \le \gamma thr(\tau +1)$.

Proof

By definition,

Therefore, after the update ($\beta ^2_\tau $ and $\beta ^1_\tau $ being added to their respective bags ($\overline{\mathcal {I}}^1_\tau $ and $\underline{\mathcal {I}}^2_\tau $) along with vectors $\overline{\nu }^2_{\tau +1}$ and $\underline{\nu }^1_{\tau +1}$),

$$\begin{aligned} {\overline{W}}_{\tau } (\sigma _\tau , \beta ^1_\tau )&\le \beta ^1_\tau \cdot \left[ r(\sigma _\tau , \cdot , \beta ^2_\tau ) + \gamma T^{1}_m(\sigma _\tau ,\cdot , \beta ^2_\tau ) \cdot \overline{\nu }^2_{\tau +1} \right] , \text { and} \\ {\underline{W}}_{\tau } (\sigma _\tau , \beta ^2_\tau )&\ge \beta ^2_\tau \cdot \left[ r(\sigma _\tau , \beta ^1_\tau , \cdot ) + \gamma T^{2}_m(\sigma _\tau , \beta ^1_\tau , \cdot ) \cdot \underline{\nu }^1_{\tau +1} \right] . \end{aligned}$$

Then,

$\square $

Lemma 11

(Monotonic evolution of ${\overline{W}}_{\tau }$ and ${\underline{W}}_{\tau }$) Let $K{\overline{W}}_{\tau }$ and $K{\underline{W}}_{\tau }$ be the approximations after an update at $\sigma _\tau $ with behavioral dr $\langle \overline{\beta }^1_\tau , \underline{\beta }^2_\tau \rangle $ (respectively, associated to vectors $\overline{\nu }^2_{\tau +1}$ and $\underline{\nu }^1_{\tau +1}$).

Let also $K^{(n+1)}{\overline{W}}_{\tau }$ and $K^{(n+1)}{\underline{W}}_{\tau }$ be the same approximations after n other updates (in various os s). Then,

$$\begin{aligned} \max _{\beta ^1_\tau } K^{(n+1)} {\overline{W}}_{\tau }(\sigma _\tau ,\beta ^1_\tau )&\le \max _{\beta ^1_\tau } K{\overline{W}}_{\tau }(\sigma _\tau ,\beta ^1_\tau ) \le {\overline{W}}_{\tau }(\sigma _\tau , \overline{\beta }^1_\tau ) \quad \text { and} \\ \min _{\beta ^2_\tau } K^{(n+1)} {\underline{W}}_{\tau }(\sigma _\tau ,\beta ^2_\tau )&\ge \min _{\beta ^2_\tau } K{\underline{W}}_{\tau }(\sigma _\tau , \beta ^2_\tau ) \ge {\underline{W}}_{\tau }(\sigma _\tau , \underline{\beta }^2_\tau ). \end{aligned}$$

Proof

Starting from the definition,

Then, this upper bound approximation can only be refined, so that, for any $n \in {{\textbf{N}}}$,

$$\begin{aligned} \forall \beta ^1_\tau , \quad K^{(n+1)}{\overline{W}}_{\tau }(\sigma _\tau ,\beta ^1_\tau )&\le K{\overline{W}}_{\tau }(\sigma _\tau ,\beta ^1_\tau ), \\ \text {thus, } \min _{\beta ^1_\tau } K^{(n+1)}{\overline{W}}_{\tau }(\sigma _\tau ,\beta ^1_\tau )&\le \min _{\beta ^1_\tau } K{\overline{W}}_{\tau }(\sigma _\tau ,\beta ^1_\tau ). \end{aligned}$$

The expected result thus holds for ${\overline{W}}_{\tau }$, and symmetrically for ${\underline{W}}_{\tau }$. $\square $

Lemma 12

After updating, in order, ${\overline{W}}_{\tau }$ and $\overline{V}_\tau $, we have

$$\begin{aligned} K\overline{V}_\tau (\sigma _\tau ) \le \max _{\beta ^1_\tau } K{\overline{W}}_{\tau }(\sigma _\tau ,\beta ^1_\tau ). \end{aligned}$$

After updating, in order, ${\underline{W}}_{\tau }$ and $\underline{V}_\tau $, we have

$$\begin{aligned} K\underline{V}_\tau (\sigma _\tau ) \ge \min _{\beta ^2_\tau } K{\underline{W}}_{\tau }(\sigma _\tau ,\beta ^2_\tau ). \end{aligned}$$

Proof

After updating $\overline{\mathcal {I}}^1_\tau $, the algorithm computes (Algorithm 1, line 20) a new solution $\overline{\psi }^2_\tau $ of the dual LP (at $\sigma ^1_\tau $) and the associated vector $\overline{\nu }^2_\tau $, so that

$$\begin{aligned} \max _{\beta ^1_\tau } K {\overline{W}}_{\tau }(\sigma ^1_\tau , \beta ^1_\tau )&= \sigma ^{m,1}_\tau \cdot \overline{\nu }^2_\tau . \end{aligned}$$

This vector will feed $\overline{bagV}_\tau $ along with $\sigma ^1_\tau $, so that

$$\begin{aligned} K \overline{V}_\tau (\sigma _\tau )&\le \sigma ^{m,1}_\tau \cdot \overline{\nu }^2_\tau . \end{aligned}$$

As a consequence,

$$\begin{aligned} K \overline{V}_\tau (\sigma _\tau )&\le \max _{\beta ^1_\tau } K {\overline{W}}_{\tau }(\sigma _\tau , \beta ^1_\tau ). \end{aligned}$$

The symmetric property holds for $K\underline{V}_\tau $ and $K{\underline{W}}_{\tau }$, which concludes the proof. $\square $

Theorem 10

(originally stated on page 15) zs-oMG-HSVI (Algorithm 1) terminates in finite time with an $\epsilon $-approximation of $V^*_0(\sigma _0)$ that satisfies Theorem 9.

Proof

We will prove by induction from $\tau =H$ to 0, that the algorithm stops expanding os s at depth $\tau $ after finitely many iterations (/trajectories).

First, by definition of horizon H, no os $\sigma _H$ is ever expanded. The property thus holds at $\tau =H$.

Let us now assume that the property holds at depth $\tau +1$ after $N_{\tau +1}$ iterations.

By contradiction, let us assume that the algorithm generates infinitely many trajectories of length $\tau +1$. Then, because ${{\mathcal {O}}}^\sigma _\tau \times {{\mathcal {B}}}_\tau $ is compact, after some time the algorithm will have visited $\langle \sigma _\tau , {\varvec{\beta }}_\tau \rangle $, then, some iterations later, $\langle \sigma _\tau ^{'}, {\varvec{\beta }}_\tau ^{'} \rangle $, such that $\Vert \sigma _\tau - \sigma _\tau ^{'} \Vert _1 \le \rho $. Let us also note the corresponding terminal os s (because trajectories beyond iteration $N_{\tau +1}$ do not go further) $\sigma _{\tau +1} = T(\sigma _\tau , \beta _\tau )$ and $\sigma _{\tau +1}^{'} = T(\sigma _\tau ^{'}, {\varvec{\beta }}_\tau ^{'})$. Now, we show that the second trajectory should not have happened, i.e., $\overline{V}(\sigma _\tau ^{'}) - \underline{V}(\sigma _\tau ^{'}) \le thr(\tau )$.

Combining the previous lemmas,

$$\begin{aligned} \overline{V}(\sigma _\tau ^{'})&\le \overline{V}(\sigma _\tau ) + \lambda _{\tau } \Vert \sigma _\tau - \sigma _\tau ^{'}\Vert _1 \qquad \qquad \text {(By Lipschitz-Continuity)}\\&\le \max _{\tilde{\beta }^1_\tau } {\overline{W}}_{\tau }(\sigma _\tau ,\tilde{\beta }^1_\tau ) + \lambda _{\tau } \Vert \sigma _\tau - \sigma _\tau ^{'}\Vert _1 \qquad \qquad (\text {Lemma}\, 12) \\&\le {\overline{W}}_{\tau }(\sigma _\tau ,\beta ^1_\tau ) + \lambda _{\tau } \Vert \sigma _\tau - \sigma _\tau ^{'}\Vert _1 \qquad \qquad (\text {Lemma}\, 11) \\&= {\overline{W}}_{\tau }(\sigma _\tau ,\beta ^1_\tau ) + \lambda _{\tau } \rho . \end{aligned}$$

Symmetrically, we also have

$$\begin{aligned} \underline{V}(\sigma _\tau ^{'})&\ge {\underline{W}}_{\tau }(\sigma _\tau ,\beta ^2_\tau ) - \lambda _{\tau } \rho . \end{aligned}$$

Hence,

Therefore, $\sigma ^{'}_\tau $ should not have been expanded.

This shows that the algorithm will generate only a finite number of trajectories of length $\tau $. $\square $

1.2.2 Handling Infinite Horizons

Proposition 17

(originally stated on page 16)

When $\gamma <1$, the length of trajectories is upper-bounded by , where $\lambda ^\infty $ is a depth-independent Lipschitz constant and $W{\mathop {=}\limits ^{\tiny {\text {def}}}}\Vert \overline{V}^{(0)}-\underline{V}^{(0)}\Vert _\infty $ is the maximum width between initializations.

Proof

(detailed version) Since W is the largest possible width, any trajectory stops in the worst case at depth $\tau $ such that

$$\begin{aligned} thr(\tau )&< W\\ \gamma ^{-\tau }\epsilon - 2 \rho \lambda ^\infty \frac{\gamma ^{-\tau }-1}{1-\gamma }&< W&(\text {from Eq.}\, (E24)) \\ \gamma ^{-\tau } \epsilon - 2 \rho \lambda ^\infty \frac{\gamma ^{-\tau }}{1-\gamma } - 2 \rho \lambda ^\infty \frac{-1}{1-\gamma }&< W\\ \gamma ^{-\tau } \underbrace{\left( \epsilon - \frac{2 \rho \lambda ^\infty }{1-\gamma } \right) }_{>0 \quad (\text {Theorem} 16)}&< W- \frac{2 \rho \lambda ^\infty }{1-\gamma } \\ \gamma ^{-\tau }&< \frac{ W- \frac{2 \rho \lambda ^\infty }{1-\gamma } }{ \epsilon - \frac{2 \rho \lambda ^\infty }{1-\gamma } }\\ \exp (-\tau \ln (\gamma ))&< \exp \left( \ln \left( \frac{ W- \frac{2 \rho \lambda ^\infty }{1-\gamma } }{ \epsilon - \frac{2 \rho \lambda ^\infty }{1-\gamma } }\right) \right) \\ -\tau \ln (\gamma )&< \ln \left( \frac{ W- \frac{2 \rho \lambda ^\infty }{1-\gamma } }{ \epsilon - \frac{2 \rho \lambda ^\infty }{1-\gamma } }\right) \\ \tau \ln (\gamma )&> \ln \left( \frac{ \epsilon - \frac{2 \rho \lambda ^\infty }{1-\gamma } }{ W- \frac{2 \rho \lambda ^\infty }{1-\gamma } }\right) \\ \tau&< \log _{\gamma }\left( \frac{ \epsilon - \frac{2 \rho \lambda ^\infty }{1-\gamma } }{ W- \frac{2 \rho \lambda ^\infty }{1-\gamma } }\right) . \end{aligned}$$

$\square $

Even if the problem horizon is infinite, trajectories will thus have bounded length. Then, everything beyond this effective horizon will rely on the upper- and lower-bound initializations and the corresponding strategies.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Delage, A., Buffet, O., Dibangoye, J.S. et al. HSVI Can Solve Zero-Sum Partially Observable Stochastic Games. Dyn Games Appl 14, 751–805 (2024). https://doi.org/10.1007/s13235-023-00519-6

Download citation

Accepted: 12 July 2023
Published: 02 September 2023
Issue Date: September 2024
DOI: https://doi.org/10.1007/s13235-023-00519-6

HSVI Can Solve Zero-Sum Partially Observable Stochastic Games

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games

A convex programming-based algorithm for mean payoff stochastic games with perfect information

Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A Synthetic Tables

Appendix B Background

1.1 B.1 Occupancy States

Proposition 12

Proof

Appendix C Occupancy Markov Games: Definition and Preliminary Properties

1.1 C.1 Properties of \(V^*\)

1.1.1 C.1.1 \(V^*\) is Not Linear in \(\beta \) (Behavioral Strategies)

1.1.2 C.1.2 Back to Mixed Strategies

Lemma 2

Proof

1.1.3 C.1.3 Von Neumann’s Minimax Theorem for Subgames and a Bellman Optimality Equation

Theorem 3

Proof

Theorem 4

Proof

Appendix D Solving zs-OMGs

1.1 D.1 Preliminary Properties

1.1.1 D.1.1 Properties of \(T^{1}_m\) and \(T^{1}_c\)

Lemma 3

Proof

Lemma 4

Proof

1.1.2 D.1.2 Linearity and Lipschitz-continuity of \(T (\sigma _\tau , \beta ^1_\tau , \beta ^2_\tau )\)

Lemma 5

Proof

1.1.3 D.1.3 Lipschitz-Continuity of \(V^*\)

Lemma 6

Proof

Theorem 5

Proof

Lemma 7

Proof

1.2 D.2 Bounding Approximations of \(V^*\), \(W^{1,*}\) and \(W^{2,*}\)

1.2.1 D.2.1 \(\overline{V}_\tau \) and \(\underline{V}_\tau \)

1.2.2 D.2.2 \({\overline{W}}_{\tau }\) and \({\underline{W}}_{\tau }\)

Lemma 8

Proof

Proposition 13

1.3 D.3 Related Operators

1.3.1 D.3.1 Selection Operator: Solving for \(\beta ^1_\tau \) as an LP

Proposition 14

Proof

Remark 2

1.3.2 D.3.2 Upper Bounding \(\nu ^2_{[\sigma ^{c,1}_\tau ,\psi ^{2}_{\tau }]}\)

Proposition 15

Proof

1.3.3 D.3.3 Strategy Conversion

Lemma 9

Proof

Appendix E HSVI for zs-POSGs

1.1 E.1 Algorithm

1.1.1 E.1.1 Setting \(\rho \)

Proposition 16

Proof

1.2 E.2 Finite-Time Convergence

1.2.1 E.2.1 Convergence Proof

Lemma 10

Proof

Lemma 11

1.2 D.2 Bounding Approximations of \(V^\), \(W^{1,}\) and \(W^{2,*}\)