We now proceed by presenting some of the main ideas of distributional reinforcement learning in a tabular setting. We will first look at the evaluation problem, where we are trying to find the state-action value of a fixed policy . Second, we consider the control problem, where we try to find the optimal state-action value. Third, we consider the distributional approximation procedure CDRL used by agents in this paper.
2.2.1. Evaluation
We consider a distributional variant of (
2), the distributional Bellman operator given by
,
Here,
is, for all
, a
-contraction in
with a unique fixed point when
is endowed with the supremum
-Wasserstein metric ([
5], Lemma 3) (see [
15] for more details on Wasserstein distances). Moreover by Proposition 2 of [
9],
is expectation preserving when we have an initial coupling with the
-iteration given in (
2); that is, given an initial
and a function
g, such that
. Then,
holds for all
.
Thus, if we let
be the function of distributions of
in (
1), then
is the unique fixed point satisfying the distributional Bellman equation:
It follows that iterating on any starting collection with bounded moments eventually solves the evaluation task of to an arbitrary degree.
2.2.3. Categorical Evaluation and Control
In most real applications, the updates of (
4) and (
5) are either computationally infeasible or impossible to fully compute due to
p being unknown. It follows that approximations are key to defining practical distributional algorithms. This could involve parametrization over some selected set of distributions along with projections onto these distributional subspaces. It could also involve stochastic approximations with sampled transitions and gradient updates with function approximation.
A structure for algorithms making use of such approximations is Categorical Distributional Reinforcement Learning (CDRL). In what follows is a short summary of the CDRL procedure fundamental to single agent implementations in this paper.
Let
be an ordered fixed set of equally-spaced real numbers such that
with
. Let:
be the subset of categorical distributions in
supported on
. We consider parameterized distributions by using
as the collection of possible inputs and outputs of an algorithm. Moreover, for each
, we have:
as its Q-value function.
Given a subsequent treatment of our extension of CDRL, we first reproduce the steps of the general procedure in Algorithm 1 (see [
10], Algorithm 1).
Algorithm 1: Categorical Distributional Reinforcement Learning (CDRL) |
At each iteration step t and input , sample a transition . Select to be either sampled from in the evaluation setting or taken as in the control setting. Recall the Cramér projection given in Definition 2, and put:
Take the next iterated function as some update such that:
where:
denotes the Kullback–Leibler divergence.
|
Consider first a finite MDP and a tabular setting. Define
whenever
. Then, by the convexity of
, it is readily verified that updates of the form:
satisfy Step 4. In fact, if there exists a unique policy
associated with the convergence of (
3), then this update yields an almost sure convergence, with respect to the supremum-Cramér metric, to a distribution in
with
as the greedy policy (with some additional assumptions on the stepsizes
and sufficient support (see [
10], Theorem 2, for details).
In practice, we are often forced to use function approximation of the form:
where
is parameterized by some set of weights
. Gradient updates with respect to
can then be made to minimize the loss:
where
is the computed learning target of the transition
. However convergence with the Kullback–Leibler loss and function approximation is still an open question. Theoretical progress has been made when considering other losses, although we may lose the stability benefits coming from the relative ease of minimizing (
6) [
9,
11,
16].
An algorithm implementing CDRL with function approximation is
C51 [
5]. It essentially uses the same neural network architecture and training procedure as DQN [
17]. To increase stability during training, this also involves sampling transitions from an experience buffer and maintaining an older, periodically updated, copy of the weights for target computation. However, instead of estimating Q-values,
C51 uses a finite support
of 51 points and learns discrete probability distributions
over
via soft-max transfer. Training is done by using the KL-divergence as the loss function over batches with computed targets
of CDRL.