×

Dynamic programming conditions for partially observable stochastic systems. (English) Zbl 0258.93029

The problem of minimizing
\[ J(U) = E \int_0^T c(t,x_t,u_t)(\,dt, \tag{1} \]
over the class of controls \(u_t = u(t,x_t)\), \(u_t \in E\subset R^k\), subject to \[ dx_t = f(t,x_t,u_t)\,dt + \sigma(t,x_t)\,dB_t \tag{2} \]
(Ito equation; \(B = \) Wiener process) centres around solution of the Bellman-Hamilton-Jacobi equation \(\Lambda W + min_{u\in\Xi} (\nabla W\cdot f + c) = 0\), where
\[ W=W(t,x),\quad \nabla W = \frac{\partial W}{\partial x}\quad\text{and}\quad \Lambda W = \frac{\partial W}{\partial t} + \frac12 \sum_{ij} (\sigma\sigma')_{ij} \frac{\partial^2 W}{\partial x_i\partial x_j}. \]
If this equation has a smooth solution \(W\), then (i) \(W\) is the“value function” (the minimum cost over \([t,1]\) starting at \(x_t =x)\), (ii) \(W\) can be expanded (by the Ito differential rule) as
\[ W_t = J(u^0) + \int_0^T \Lambda W \,dt + \int_0^T \nabla W \,dx \tag{3} \]
and (iii) the optimal policy is \(u^0(t,x) = \lambda(t,x,\nabla W)\) where \(\lambda(t,x,p)\) is the value of \(u\in\Xi\) which minimizes \(p\cdot f(t,x,u)+h(t,x,u)\). See W. H. Fleming [SIAM Rev. 11, 470–509 (1969; Zbl 0192.52501)].
The objective of this paper is to obtain similar characterizations of the optimal policy for problems similar to (1), (2) but possibly non-Markovian and with partial observations. The solution of (2) is defined by the Girsanov measure transformation technique. Then value functions (processes) \(W_1(t)\) are defined directly and shown to satisfy a version of Bellman’s optimality principle. A control \(u\) is called “value decreasing” if the process \(W_n(t)\) is a supermartingale. For such controls we can apply the Meyer decomposition of supermartingales and write \(W_n(t) = -A_t + M_t\), where \(A_t\) is an increasing process and \(M_t\) a martingale. It is shown that \(A_t\) is absolutely continuous w.r.t. Lebesgue measure and \(M_t\) has a stochastic integral representation. Thus a representation of \(W_1(t)\) of the form (3) is obtained \((\Lambda W_n\) and \(\nabla W_n\) are now abstract processes) and conditions for optimality can be derived using this together with Bellman’s principle. In the case of complete observations there is one value process \(W\) (the same for all controls), which has the above representation.
In a subsequent paper [the first author, ibid. 11, 587–594 (1973; Zbl 0238.93044)] it has been shown that (iii) above holds, and this leads to an existence theorem for the optimal policy since the process \(W\) is defined whether or not an optimal policy exists.
Reviewer: M. H. A. Davis

MSC:

93E20 Optimal stochastic control
60G42 Martingales with discrete parameter