×

Reward prediction errors, not sensory prediction errors, play a major role in model selection in human reinforcement learning. (English) Zbl 1525.91148

Summary: Model-based reinforcement learning enables an agent to learn in variable environments and tasks by optimizing its actions based on the predicted states and outcomes. This mechanism has also been considered in the brain. However, exactly how the brain selects an appropriate model for confronting environments has remained unclear. Here, we investigated the model selection algorithm in the human brain during a reinforcement learning task. One primary theory of model selection in the brain is based on sensory prediction errors. Here, we compared this theory with an alternative possibility of internal model selection with reward prediction errors. To compare these two theories, we devised a switching experiment from a first-order Markov decision process to a second-order Markov decision process that provides either reward- or sensory prediction error regarding environmental change. We tested two representative computational models driven by different prediction errors. One is the sensory prediction-error-driven Bayesian algorithm, which has been discussed as a representative internal model selection algorithm in the animal reinforcement learning task. The other is the reward-prediction-error-driven policy gradient algorithm. We compared the simulation results of these two computational models with human reinforcement learning behaviors. The model fitting result supports that the policy gradient algorithm is preferable to the Bayesian algorithm. This suggests that the human brain employs the reward prediction error to select an appropriate internal model in the reinforcement learning task.

MSC:

91E40 Memory and learning in psychology
91-05 Experimental work for problems pertaining to game theory, economics, and finance
Full Text: DOI

References:

[1] Bellman, R., Dynamic programming, Science, 153, 3731, 34-37 (1966)
[2] Bertin, M.; Schweighofer, N.; Doya, K., Multiple model-based reinforcement learning explains dopamine neuronal activity, Neural Networks, 20, 6, 668-675 (2007) · Zbl 1119.68376
[3] Daw, N. D.; Dayan, P., The algorithmic anatomy of model-based evaluation, Rpilosohical Transactions of the Royal Society B: Biological Sciences, 369, 1655, Article 20130478 pp. (2014)
[4] Daw, N. D.; Gershman, S. J.; Seymour, B.; Dayan, P.; Dolan, R. J., Model-based influences on humans’ choices and striatal prediction errors, Neuron, 69, 6, 1204-1215 (2011)
[5] Doll, B. B.; Simon, D. A.; Daw, N. D., The ubiquity of model-based reinforcement learning, Current Opinion in Neurobiology, 22, 6, 1075-1081 (2012)
[6] Donoso, M.; Collins, A. G.; Koechlin, E., Foundations of human reasoning in the prefrontal cortex, Science, 344, 1481-1486 (2014)
[7] Doya, K.; Samejima, K.; Katagiri, K.-i.; Kawato, M., Multiple model-based reinforcement learning, Neural Computation, 14, 6, 1347-1369 (2002) · Zbl 0997.93037
[8] Fermin, A. S.; Yoshida, T.; Yoshimoto, J.; Ito, M.; Tanaka, S. C.; Doya, K., Model-based action planning involves cortico-cerebellar and basal ganglia networks, Scientific Reports, 6, 1, 1-14 (2016)
[9] Gläscher, J.; Daw, N. D.; Dayan, P.; O’Doherty, J., States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning, Neuron, 66, 4, 585-595 (2010)
[10] Haruno, M.; Wolpert, D. M.; Kawato, M., Mosaic model for sensorimotor learning and control, Neural Computation, 13, 10, 2201-2220 (2001) · Zbl 0984.68151
[11] Kaelbling, L. P.; Littman, M. L.; Cassandra, A. R., Planning and acting in partially observable stochastic domains, Artificial Intelligence, 101, 1-2, 99-134 (1998) · Zbl 0908.68165
[12] Klaus, W.; Smittenaar, P.; Dolan, R. J., Dopamine enhances model-based over model-free choice behavior, Neuron, 75, 3, 418-424 (2012)
[13] Lee, S. W.; Shimojo, S.; O’Doherty, J. P., Neural computations underlying arbitration between model-based and model-free learning, Neuron, 81, 3, 687-699 (2014)
[14] Littman, M. L.; Cassandra, A. R.; Kaelbling, L. P., Learning policies for partially observable environments: Scaling up, (Machine learning proceedings 1995 (1995), Morgan Kaufmann), 362-370
[15] Mauricio, A.; Olivier, B.; Vincent, T.; Françcois, C., A POMDP extension with belief-dependent rewards, Advances in Neural Information Processing Systems, 23 (2010)
[16] Peter, R.; Perscott, T. J.; Gurney, K., The basal ganglia: a vertebrate solution to the selection problem?, Neuroscience, 89, 4, 1009-1023 (1999)
[17] Peters, J.; Schaal, S., Policy gradient methods for robotics, (IEEE/RSJ international conference on intelligent robots and systems (2006), IEEE)
[18] Russek, E. M.; Momennejad, I.; Botvinick, M. M.; Gershman, S. J.; Daw, N. D., Predictive representations can link model-based reinforcement learning to model-free mechanisms, PLoS Computational Biology, 13, 9, Article e1005768 pp. (2017)
[19] Singh, S. P., Transfer of learning by composing solutions of elemental, Machine Learning, 8, 3, 323-339 (1992) · Zbl 0772.68073
[20] Sugimoto, N.; Haruno, M.; Doya, K.; Kawato, M., MOSAIC for multiple-reward environments, Neural Computation, 24, 3, 577-606 (2012) · Zbl 1238.68132
[21] Sutton, R. S., Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, machine learning proceedings 1990, 216-224 (1990), Morgan Kaufmann
[22] Sutton, R. S.; Barto, A. G., Introduction to reinforcement learning, Vol 135 (1998), MIT Press: MIT Press Cambridge
[23] Sutton, R. S.; McAllester, D.; Singh, S.; Mansour, Y., Policy gradient methods for reinforcement learning with function approximation, Advances in Neural Information Processing Systems, 12 (1999)
[24] Todd, M.; Niv, Y.; Cohen, J. D., Learning to use working memory in partially observable environments through dopaminergic reinforcement, Advances in Neural Information Processing Systems, 21 (2008)
[25] Williams, R. J., Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, 8, 3, 229-256 (1992) · Zbl 0772.68076
[26] Wolfram, S.; Dayan, P.; Montague, P. R., A neural substrate of prediction and reward, Science, 275, 5306, 1593-1599 (1997)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.