×

A spiking neural network model of an actor-critic learning agent. (English) Zbl 1178.68457

Summary: The ability to adapt behavior to maximize reward as a result of interactions with the environment is crucial for the survival of any higher organism. In the framework of reinforcement learning, temporal-difference learning algorithms provide an effective strategy for such goal-directed adaptation, but it is unclear to what extent these algorithms are compatible with neural computation. In this article, we present a spiking neural network model that implements actor-critic temporal-difference learning by combining local plasticity rules with a global reward signal. The network is capable of solving a nontrivial gridworld task with sparse rewards. We derive a quantitative mapping of plasticity parameters and synaptic weights to the corresponding variables in the standard algorithmic formulation and demonstrate that the network learns with a similar speed to its discrete time counterpart and attains the same equilibrium performance.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI

References:

[1] DOI: 10.1017/CBO9780511623257 · Zbl 0709.92001 · doi:10.1017/CBO9780511623257
[2] DOI: 10.1016/S0896-6273(02)01129-7 · doi:10.1016/S0896-6273(02)01129-7
[3] DOI: 10.1038/nn1817 · doi:10.1038/nn1817
[4] DOI: 10.1162/neco.2007.19.8.2245 · Zbl 1129.92001 · doi:10.1162/neco.2007.19.8.2245
[5] Barto A. G., Models of information processing in the basal ganglia pp 215– (1995)
[6] DOI: 10.1109/TSMC.1983.6313077 · doi:10.1109/TSMC.1983.6313077
[7] Bertsekas D. P., Neuro-dynamic programming (1996) · Zbl 0924.68163
[8] Bi G.-q., J. Neurosci. 18 pp 10464– (1998)
[9] DOI: 10.1073/pnas.86.20.8113 · doi:10.1073/pnas.86.20.8113
[10] DOI: 10.1126/science.272.5264.998 · doi:10.1126/science.272.5264.998
[11] Dayan P., Machine Learning 8 pp 341– (1992)
[12] Dayan P., Machine Learning 14 pp 295– (1994)
[13] DOI: 10.1016/S0959-4388(00)00153-7 · doi:10.1016/S0959-4388(00)00153-7
[14] DOI: 10.1162/089976600300015961 · doi:10.1162/089976600300015961
[15] DOI: 10.1016/S0893-6080(02)00044-8 · doi:10.1016/S0893-6080(02)00044-8
[16] DOI: 10.1152/jn.00364.2007 · doi:10.1152/jn.00364.2007
[17] DOI: 10.1038/41267 · doi:10.1038/41267
[18] DOI: 10.1162/neco.2007.19.6.1468 · Zbl 1115.68473 · doi:10.1162/neco.2007.19.6.1468
[19] DOI: 10.1002/(SICI)1098-1063(2000)10:1<1::AID-HIPO1>3.0.CO;2-1 · doi:10.1002/(SICI)1098-1063(2000)10:1<1::AID-HIPO1>3.0.CO;2-1
[20] DOI: 10.1038/416433a · doi:10.1038/416433a
[21] DOI: 10.1017/S0952523898156158 · doi:10.1017/S0952523898156158
[22] Georgopoulos A., J. Neurosci. 11 (2) pp 1527– (1982)
[23] DOI: 10.4249/scholarpedia.1430 · doi:10.4249/scholarpedia.1430
[24] DOI: 10.1162/neco.2006.18.11.2651 · Zbl 1102.92004 · doi:10.1162/neco.2006.18.11.2651
[25] DOI: 10.1023/B:JCNS.0000037682.18051.5f · doi:10.1023/B:JCNS.0000037682.18051.5f
[26] Houk J. C., A model of how the basal ganglia generate and use neural signals that predict reinforcement (1995)
[27] DOI: 10.1093/cercor/bhl152 · doi:10.1093/cercor/bhl152
[28] DOI: 10.1016/S0893-6080(02)00047-3 · doi:10.1016/S0893-6080(02)00047-3
[29] DOI: 10.1073/pnas.061369698 · doi:10.1073/pnas.061369698
[30] Klopf, A. (1986). A drive-reinforcement model of single neuron function. In J. Denker (Ed.),Neural networks for computing: AIP Conference Proceedings(Vol. 151, pp. 265–270), New York: American Institute of Physics.
[31] Klopf A., Psychobiology 16 pp 85– (1988)
[32] DOI: 10.1137/S0363012901385691 · Zbl 1049.93095 · doi:10.1137/S0363012901385691
[33] DOI: 10.1063/1.36225 · doi:10.1063/1.36225
[34] DOI: 10.1097/00001756-199010000-00008 · doi:10.1097/00001756-199010000-00008
[35] DOI: 10.1103/PhysRevLett.87.248101 · doi:10.1103/PhysRevLett.87.248101
[36] DOI: 10.1007/s00422-002-0354-x · Zbl 1105.92321 · doi:10.1007/s00422-002-0354-x
[37] DOI: 10.1038/nn0107-15 · doi:10.1038/nn0107-15
[38] DOI: 10.1126/science.275.5297.213 · doi:10.1126/science.275.5297.213
[39] DOI: 10.1038/377725a0 · doi:10.1038/377725a0
[40] Montague P. R., J. Neurosci. 16 (5) pp 1936– (1996)
[41] DOI: 10.1016/S0921-8890(01)00113-0 · Zbl 1014.68179 · doi:10.1016/S0921-8890(01)00113-0
[42] DOI: 10.1038/nn1743 · doi:10.1038/nn1743
[43] DOI: 10.1007/s00422-008-0233-1 · Zbl 1145.92306 · doi:10.1007/s00422-008-0233-1
[44] DOI: 10.1162/neco.2007.19.11.2958 · Zbl 1129.92024 · doi:10.1162/neco.2007.19.11.2958
[45] Munos R., Journal of Machine Learning Research 7 pp 771– (2006)
[46] DOI: 10.1177/10597123020101001 · doi:10.1177/10597123020101001
[47] DOI: 10.1016/S0896-6273(03)00169-7 · doi:10.1016/S0896-6273(03)00169-7
[48] DOI: 10.1126/science.1094285 · doi:10.1126/science.1094285
[49] DOI: 10.1038/nature05051 · doi:10.1038/nature05051
[50] DOI: 10.1523/JNEUROSCI.1425-06.2006 · doi:10.1523/JNEUROSCI.1425-06.2006
[51] DOI: 10.1007/978-3-540-73007-1_58 · doi:10.1007/978-3-540-73007-1_58
[52] DOI: 10.1162/08997660360581921 · Zbl 1022.68111 · doi:10.1162/08997660360581921
[53] DOI: 10.1162/neco.2007.19.10.2694 · Zbl 1129.92026 · doi:10.1162/neco.2007.19.10.2694
[54] Potjans W., Neuroforum 8 (1) (2007)
[55] Potjans W., Proceedings of the 37th SFN Meeting (2007)
[56] DOI: 10.1162/089976601750541787 · Zbl 0982.92006 · doi:10.1162/089976601750541787
[57] DOI: 10.1016/S0893-6080(02)00045-X · doi:10.1016/S0893-6080(02)00045-X
[58] DOI: 10.1023/A:1008910918445 · Zbl 0955.92009 · doi:10.1023/A:1008910918445
[59] DOI: 10.1016/S0896-6273(02)00967-4 · doi:10.1016/S0896-6273(02)00967-4
[60] DOI: 10.1126/science.275.5306.1593 · doi:10.1126/science.275.5306.1593
[61] DOI: 10.1126/science.8290963 · doi:10.1126/science.8290963
[62] DOI: 10.1016/S0896-6273(03)00761-X · doi:10.1016/S0896-6273(03)00761-X
[63] DOI: 10.1038/nature02581 · doi:10.1038/nature02581
[64] DOI: 10.1016/S0306-4522(98)00697-6 · doi:10.1016/S0306-4522(98)00697-6
[65] DOI: 10.1162/089976601300014376 · Zbl 1003.92010 · doi:10.1162/089976601300014376
[66] Sutton R., Machine Learning 3 pp 9– (1988)
[67] Sutton R. S., Reinforcement Learning: An Introduction (1998)
[68] Tao H.-z. W., J. Neurosci. 20 (9) pp 3233– (2000)
[69] DOI: 10.1162/neco.1994.6.2.215 · doi:10.1162/neco.1994.6.2.215
[70] DOI: 10.1016/S0896-6273(03)00848-1 · doi:10.1016/S0896-6273(03)00848-1
[71] DOI: 10.1016/j.tins.2004.10.010 · doi:10.1016/j.tins.2004.10.010
[72] Williams R., Machine Learning 8 pp 229– (1992)
[73] DOI: 10.1016/S0019-9958(77)90354-0 · Zbl 0373.93025 · doi:10.1016/S0019-9958(77)90354-0
[74] DOI: 10.1162/0899766053011555 · doi:10.1162/0899766053011555
[75] DOI: 10.1103/PhysRevE.69.041909 · doi:10.1103/PhysRevE.69.041909
[76] DOI: 10.1038/25665 · doi:10.1038/25665
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.