×

Reinforcement learning for a biped robot based on a CPG-actor-critic method. (English) Zbl 1129.68067

Summary: Animals’ rhythmic movements, such as locomotion, are considered to be controlled by neural circuits called Central Pattern Generators (CPGs), which generate oscillatory signals. Motivated by this biological mechanism, studies have been conducted on the rhythmic movements controlled by CPG. As an autonomous learning framework for a CPG controller, we propose in this article a reinforcement learning method we call the “CPG-actor-critic” method. This method introduces a new architecture to the actor, and its training is roughly based on a stochastic policy gradient algorithm presented recently. We apply this method to an automatic acquisition problem of control for a biped robot. Computer simulations show that training of the CPG can be successfully performed by our method, thus allowing the biped robot to not only walk stably but also adapt to environmental changes.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
68T40 Artificial intelligence for robotics
Full Text: DOI

References:

[1] Amari, S., Natural gradient works efficiently in learning, Neural Computation, 10, 2, 251-276 (1998)
[2] Asano, F.; Yamakita, M.; Kamamichi, N.; Luo, Z.-W., A novel gait generation for biped walking robots based on mechanical energy constraint, IEEE Transactions on Robotics and Automation, 20, 3, 565-573 (2004)
[3] Bentivegna, D. C., Ude, A., Atkeson, C. G., & Cheng, G. (2002). Humanoid robot learning and game playing using pc-based vision. In Proceedings of the 2002 IEEE/RSJ international conference on intelligent robots and systems; Bentivegna, D. C., Ude, A., Atkeson, C. G., & Cheng, G. (2002). Humanoid robot learning and game playing using pc-based vision. In Proceedings of the 2002 IEEE/RSJ international conference on intelligent robots and systems
[4] Bradtke, S. J.; Barto, A. G., Linear least-squares algorithms for temporal difference learning, Machine Learning, 22, 33-57 (1996) · Zbl 1099.93534
[5] Cruse, H.; Kindermann, T.; Schumm, M.; Dean, J.; Schmitz, J., Walknet — A biologically inspired network to control six-legged walking, Neural Networks, 11, 7-8, 1435-1447 (1998)
[6] Doya, K.; Yoshizawa, S., Adaptive synchronization of neural and physical oscillators, Advances in Neural Information Processing Systems, 4, 109-116 (1992)
[7] Ekeberg, Ö., A combined neuronal and mechanical model of fish swimming, Biological Cybernetics, 69, 363-374 (1993) · Zbl 0780.92007
[8] Fukuoka, Y.; Kimura, H.; Cohen, A. H., Adaptive dynamic walking of a quadruped robot on irregular terrain based on biological concepts, International Journal of Robotics Research, 22, 3-4, 187-202 (2003)
[9] Grillner, S.; Wallen, P.; Brodin, L.; Lansner, A., Neuronal network generating locomotor behavior in lamprey: Circuitry, transmitters, membrane properties and simulations, Annual Review of Neuroscience, 14, 169-199 (1991)
[10] Hirai, K., Hirose, M., Haikawa, Y., & Takenaka, T. (1998). The development of honda humanoid robot. In Proceedings of the 1998 IEEE international conference on robotics & automation; Hirai, K., Hirose, M., Haikawa, Y., & Takenaka, T. (1998). The development of honda humanoid robot. In Proceedings of the 1998 IEEE international conference on robotics & automation
[11] Hitomi, K., Shibata, T., Nakamura, Y., & Ishii, S. (2005). On-line learning of a feedback controller for quasi-passive-dynamic walking by a stochastic policy gradient method. In IEEE/RSJ international conference on intelligent robots and systems; Hitomi, K., Shibata, T., Nakamura, Y., & Ishii, S. (2005). On-line learning of a feedback controller for quasi-passive-dynamic walking by a stochastic policy gradient method. In IEEE/RSJ international conference on intelligent robots and systems
[12] Ijspeert, A. J.; Cabeluguen, J.-M., Gait transition from swimming to walking: Investigation of salamander locomotion control using nonlinear oscillators (2003)
[13] Inada, H., & Ishii, K. (2003). Behavior generation of bipedal robot using central pattern generator(CPG) (1st report: Cpg parameters searching method by genetic algorithm). In Proceedings of international conference on intelligent robots and systemsVol. 3; Inada, H., & Ishii, K. (2003). Behavior generation of bipedal robot using central pattern generator(CPG) (1st report: Cpg parameters searching method by genetic algorithm). In Proceedings of international conference on intelligent robots and systemsVol. 3
[14] Inamura, T.; Nakamura, Y.; Toshima, I., Embodied symbol emergence based on mimesis theory, International Journal of Robotics Research, 23, 4, 363-377 (2004)
[15] Ishii, S.; Yoshida, W.; Yoshimoto, J., Control of exploitation-exploration meta-parameter in reinforcement learning, Neural Networks, 15, 4, 665-687 (2002)
[16] Itoh, Y., Taki, K., Kato, S., & Itoh, H. (2004). A stochastic optimization method of CPG-based motion control for humanoid locomotion. In IEEE conference on robotics, automation and mechatronics; Itoh, Y., Taki, K., Kato, S., & Itoh, H. (2004). A stochastic optimization method of CPG-based motion control for humanoid locomotion. In IEEE conference on robotics, automation and mechatronics
[17] Kaelbling, L. P.; Littman, M. L.; Moore, A. P., Reinforcement learning: A survey, Journal of Artificial Intelligence Research, 4, 237-285 (1996)
[18] Kakade, S., A natural policy gradient, (Advances in neural information processing systems, Vol. 14 (2001)), 1531-1538
[19] Keller, P. W., Mannor, S., & Precup, D. (2006). Automatic basis function construction for approximate dynamic programming and reinforcement learning. In The 23rd international conference on machine learning; Keller, P. W., Mannor, S., & Precup, D. (2006). Automatic basis function construction for approximate dynamic programming and reinforcement learning. In The 23rd international conference on machine learning
[20] Kimura, H., & Kobayashi, S. (1998). An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function. In 15th international conference on machine learning; Kimura, H., & Kobayashi, S. (1998). An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function. In 15th international conference on machine learning
[21] Konda, V. R.; Tsitsiklis, J. N., Actor-critic algorithms, SIAM Journal on Control and Optimization, 42, 4, 1143-1146 (2003) · Zbl 1049.93095
[22] Kotosaka, S., & Schaal, S. (2000). Synchronized robot drumming by neural oscillator. In The international symposium on adaptive motion of animals and machines; Kotosaka, S., & Schaal, S. (2000). Synchronized robot drumming by neural oscillator. In The international symposium on adaptive motion of animals and machines
[23] Lagoudakis, M. G., Parr, R., & Littman, M. L. (2002). Least-squares methods in reinforcement learning for control. In Methods and applications of artificial intelligence, second hellenic conference on AI, SETN; Lagoudakis, M. G., Parr, R., & Littman, M. L. (2002). Least-squares methods in reinforcement learning for control. In Methods and applications of artificial intelligence, second hellenic conference on AI, SETN · Zbl 1065.68608
[24] Lewis, M., Fagg, A., & Bekey, G. (1993). Genetic algorithms for gait synthesis in a hexapod robot. In Recent trends in mobile robots; Lewis, M., Fagg, A., & Bekey, G. (1993). Genetic algorithms for gait synthesis in a hexapod robot. In Recent trends in mobile robots
[25] Lim, H.; Yamamoto, Y.; Takanishi, A., Stabilization control for biped follow walking, Advanced Robotics, 16, 4, 361-380 (2002)
[26] Matsuoka, K., Sustained oscillations generated by mutually inhibiting neurons with adaption, Biological Cybernetics, 52, 367-376 (1985) · Zbl 0574.92013
[27] McGeer, T., Passive dynamic walking, International Journal of Robotics Research, 9, 2, 62-82 (1990)
[28] Menache, I.; Mannor, S.; Shimkin, N., Basis function adaptation in temporal difference reinforcement learning, Annals of Operations Research, 134, 1, 215-238 (2005) · Zbl 1075.90073
[29] Miyashita, K.; Ok, S.; Hase, K., Evolutionary generation of human-like bipedal locomotion, Mechatronics, 13, 791-807 (2003)
[30] Morimoto, J.; Atkeson, C. G., Minimax differential dynamic programming: An application to robust biped walking, Advances in Neural Information Processing Systems, 15, 1539-1546 (2003)
[31] Morimoto, J.; Doya, K., Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning, Robotics and Autonomous Systems, 36, 37-51 (2001) · Zbl 1014.68179
[32] Nakamura, Y., Sato, M., & Ishii, S. (2003). Reinforcement learning for biped robot. In 2nd international symposium on adaptive motion of animals and machines; Nakamura, Y., Sato, M., & Ishii, S. (2003). Reinforcement learning for biped robot. In 2nd international symposium on adaptive motion of animals and machines · Zbl 1013.68827
[33] Nakanishi, J.; Morimoto, J.; Endo, G.; Cheng, G.; Schaal, S.; Kawato, M., Learning from demonstration and adaptation of biped locomotion, Robotics and Autonomous Systems, 47, 79-91 (2004)
[34] Nishii, J., Legged insects select the optimal locomotor pattern based on the energetic cost, Journal of Biological Cybernetics, 83, 5 (2000)
[35] Ogihara, N.; Yamazaki, N., Generation of human bipedal locomotion by a bio-mimetic neuro-musculo-skeletal model, Biological Cybernetics, 84, 1-11 (2001)
[36] Park, H.; Amari, S.; Fukumizu, K., Natural gradient works efficiently in learning, Neural Networks, 13, 755-764 (2000)
[37] Pearlmutter, B. A., Gradient calculations for dynamic recurrent neural networks: A survey, IEEE Transactions on Neural Networks, 6, 5, 1212-1228 (1995)
[38] Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. In Third IEEE international conference on humanoid robotics 2003; Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. In Third IEEE international conference on humanoid robotics 2003
[39] Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. In The 16th European conference on machine learning; Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. In The 16th European conference on machine learning
[40] Ratitch, B., & Precup, D. (2004). Sparse distributed memories for on-line value-based reinforcement learning. In The 16th European conference on machine learning; Ratitch, B., & Precup, D. (2004). Sparse distributed memories for on-line value-based reinforcement learning. In The 16th European conference on machine learning · Zbl 1132.68586
[41] Sato, M., A real time learning algorithm for recurrent analog neural networks, Biological Cybernetics, 62, 237-241 (1990) · Zbl 0686.92010
[42] Sato, M.; Ishii, S., Reinforcement learning based on on-line em algorithm, Advances in Neural Information Processing Systems, 11, 1052-1058 (1999)
[43] Schaal, S., Peters, J., & Ijspeert, J. N. A. (2004). Learning movement primitives. In International symposium on robotics research; Schaal, S., Peters, J., & Ijspeert, J. N. A. (2004). Learning movement primitives. In International symposium on robotics research
[44] Sutton, R. S.; Barto, A. G., Reinforcement learning: An introduction (1998), MIT Press
[45] Sutton, R. S.; McAllester, D.; Singh, S.; Manour, Y., Policy gradient method for reinforcement learning with function approximation, (Advances in neural information processing systems, Vol. 12 (2000)), 1057-1063
[46] Taga, G.; Yamaguchi, Y.; Shimizu, H., Self-organized control of bipedal locomotion by neural oscillators in unpredictable environment, Biological Cybernetics, 65, 147-159 (1991) · Zbl 0734.92005
[47] Tedrake, R., Zhang, T. W., & Seung, H. S. (2004). Stochastic policy gradient reinforcement learning on a simple 3d biped. In Proceedings of the IEEE international conference on intelligent robots and systems; Tedrake, R., Zhang, T. W., & Seung, H. S. (2004). Stochastic policy gradient reinforcement learning on a simple 3d biped. In Proceedings of the IEEE international conference on intelligent robots and systems
[48] Thrun, S. B., The role of exploration in learning control with neural networks, (White, D. A.; Sofge, D. A., Handbook of intelligent control: Neural, fuzzy and adaptive approaches (1992), Van Nostrand Reinhold: Van Nostrand Reinhold Florence, Kentucky)
[49] Wadden, T.; Ekeberg, O., A neuro-mechanical model of legged locomotion: Single leg control, Biological Cybernetics, 79, 2, 161-173 (1998)
[50] White, D. A.; Sofge, D. A., Applied learning: Optimal control for manufacturing (1992), (pp. 259-282)
[51] Williams, R. J., Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, 8, 229-256 (1992) · Zbl 0772.68076
[52] Williamson, M. M., Neural control of rhythmic arm movements, Neural Networks, 11, 7-8, 1379-1394 (1998)
[53] Yoshimoto, J., Ishii, S., & Sato, M. (2000). On-line em reinforcement learning. In IEEE-INNS-ENNS international joint conference on neural networksVol. 3; Yoshimoto, J., Ishii, S., & Sato, M. (2000). On-line em reinforcement learning. In IEEE-INNS-ENNS international joint conference on neural networksVol. 3
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.