Document Zbl 1446.62112

The optimal unbiased value estimator and its relation to LSTD, TD and MC. (English) Zbl 1446.62112

Mach. Learn. 83, No. 3, 289-330 (2011).

Summary: In this analytical study we derive the optimal unbiased value estimator (MVU) and compare its statistical risk to three well known value estimators: Temporal Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares Temporal Difference Learning (LSTD). We demonstrate that LSTD is equivalent to the MVU if the Markov Reward Process (MRP) is acyclic and show that both differ for most cyclic MRPs as LSTD is then typically biased. More generally, we show that estimators that fulfill the Bellman equation can only be unbiased for special cyclic MRPs. The reason for this is that at each state the bias is calculated with a different probability measure and due to the strong coupling by the Bellman equation it is typically not possible for a set of value estimators to be unbiased with respect to each of these measures. Furthermore, we derive relations of the MVU to MC and TD. The most important of these relations is the equivalence of MC to the MVU and to LSTD for undiscounted MRPs in which MC has the same amount of information. In the discounted case this equivalence does not hold anymore. For TD we show that it is essentially unbiased for acyclic MRPs and biased for cyclic MRPs. We also order estimators according to their risk and present counter-examples to show that no general ordering exists between the MVU and LSTD, between MC and LSTD and between TD and MC. Theoretical results are supported by examples and an empirical evaluation.

Cited in 1 Document

MSC:

62G07	Density estimation
65C05	Monte Carlo methods
68T05	Learning and adaptive systems in artificial intelligence

Keywords:

optimal unbiased value estimator; maximum likelihood value estimator; sufficient statistics; Lehmann-Scheffe theorem; Monte Carlo estimation (MC); temporal difference learning (TD); least-squares temporal difference learning (LSTD)

Cite Review PDF

Full Text: DOI arXiv

References:

[1]	Aigner, M. (2006). A course in enumeration. Berlin: Springer. · Zbl 1123.05001
[2]	Bauer, H., & Burckel, R. B. (1995). Probability theory. Berlin: de Gruyter.
[3]	Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena Scientific: Nashua. · Zbl 0924.68163
[4]	Boyan, J. (1998). Learning evaluation functions for global optimization. PhD thesis, School of Computer Science Carnegie Mellon University.
[5]	Boyan, J. (1999). Least-squares temporal difference learning. In International conference machine learning. · Zbl 1014.68072
[6]	Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22. · Zbl 1099.93534
[7]	Grünewälder, S., Hochreiter, S., & Obermayer, K. (2007). Optimality of lstd and its relation to mc. In Proceedings of the international joint conference of neural networks.
[8]	Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation. · Zbl 0822.68095
[9]	Kearns, M., & Singh, S. (2000). Bias-variance error bounds for temporal difference updates. In Conference on computational learning theory.
[10]	Lehmann, E. L., & Casella, G. (1998). Theory of point estimation. Springer texts in stat. Berlin: Springer. · Zbl 0916.62017
[11]	Lugosi, G. (2006). Concentration-of-measure inequalities. Lecture notes.
[12]	Mannor, S., Simester, D., Sun, P., & Tsitsiklis, J. N. (2007). Bias and variance approximation in value function estimates. Management Science, 53. · Zbl 1232.90344
[13]	Singh, S., & Dayan, P. (1998). Analytical mean squared error curves for temporal difference learning. Machine Learning, 32. · Zbl 0901.68168
[14]	Singh, S., & Sutton, R. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22, 123-158. · Zbl 0843.68094
[15]	Sobel, M. J. (1982). The variance of discounted Markov decision processes. Journal of Applied Probability, 19. · Zbl 0503.90091
[16]	Stuart, A., & Ord, K. (1991). Kendall’s advanced theory of statistics (5th ed.). Sevenoaks: Edward Arnold. · Zbl 0727.62002
[17]	Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3.
[18]	Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press.
[19]	Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8. · Zbl 0773.68062

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.