×

Probabilistic inference for determining options in reinforcement learning. (English) Zbl 1386.68127

Summary: Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
60K15 Markov renewal processes, semi-Markov processes
68T40 Artificial intelligence for robotics

Software:

PRMLT
Full Text: DOI

References:

[1] Barto, A.G., Singh, S, and Chentanez, N. (2004). Intrinsically motivated learning of hierarchical collections of skills. Proceedings of the International Conference on Developmental Learning (ICDL).
[2] Baum, L. E. (1972). An equality and associated maximization technique in statistical estimation for probabilistic functions of markov processes. Inequalities, 3, 1-8.
[3] Bishop, Christopher M. (2006). Pattern recognition and machine learning (information science and statistics). New York: Springer. · Zbl 1107.68072
[4] Da Silva, B., Konidaris, G., and Barto, A.G. (2012). Learning parameterized skills. In Proceedings of the International Conference on Machine Learning (ICML).
[5] Daniel, C., Neumann, G., and Peters, J. (2012). Hierarchical relative entropy policy search. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). · Zbl 1367.68318
[6] Daniel, C., Neumann, G., Kroemer, O., and Peters, J. (2013). Learning sequential motor tasks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).
[7] Dayan, P., and Hinton, G. E. (1993). Feudal reinforcement learning. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems (pp. 271-278). Los Altos: Morgan Kaufmann Publishers. · Zbl 0876.68090
[8] Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research (JAIR), 13, 227-303. · Zbl 0963.68085
[9] Fox, E. B., Jordan, M. I., Sudderth, E. B., and Willsky, A. S. (2009). Sharing features among dynamical systems with beta processes. In Advances in Neural Information Processing Systems (NIPS), pp. 549-557.
[10] Ghavamzadeh, M., and Mahadevan, S. (2003). Hierarchical policy gradient algorithms. In Proceedings of the International Conference for Machine Learning (ICML). · Zbl 1222.68202
[11] Kaelbling, L. P., (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the International Conference on Machine Learning (ICML).
[12] Kajita, S., Kanehiro, K., and Kaneko, F., Fujiwara, K., Harada, K., Yokoi, K., and Hirukawa, H. (2003). Biped walking pattern generation by using preview control of zero-moment point. In Proceedings of the IEEE International Conference of Robotics and Automation (ICRA).
[13] Kober, J., and Peters, J. (2010). Policy search for motor primitives in robotics. Machine Learning, 84, 1-33. · Zbl 1237.68229
[14] Konidaris, G., and Barto, A. (2009). Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems (NIPS).
[15] Konidaris, G., Osentoski, S., and Thomas, P.s. (2011). Value function approximation in reinforcement learning using the fourier basis. Conference on Artificial Intelligence (AAAI).
[16] Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107-1149. · Zbl 1094.68080
[17] Levy, K. Y., and Shimkin, N. (2012). Unified inter and intra options learning using policy gradient methods. In S. Sanner & M. Hutter (Eds.), Recent advances in reinforcement learning (pp. 153-164). New York: Springer.
[18] Mann, T. A., and Mannor, S. (2014) Scaling up approximate value iteration with options: Better policies with fewer iterations. In Proceedings of the International Conference on Machine Learning (ICML).
[19] McGovern, A., and Barto, A. G. (2001a). International conference on machine learning (icml). Computer Science Department Faculty Publication Series, 8, 361-368.
[20] McGovern, A., and Barto, A. G. (2001b). Automatic discovery of subgoals in reinforcement learning using diverse density. International Conference on Machine Learning (ICML), pp. 8.
[21] Mehta, N., Ray, S., Tadepalli, P., and Dietterich, T. G. (2008). Automatic discovery and transfer of MAXQ hierarchies. In Proceedings of the International Conference on Machine Learning (ICML).
[22] Menache, I., Mannor, S., and Shimkin, N. (2002). Q-cut-dynamic discovery of sub-goals in reinforcement learning. In Proceedings of the European Conference on Machine Learning (ECML). · Zbl 1014.68796
[23] Morimoto, J., & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36(1), 37-51. · Zbl 1014.68179 · doi:10.1016/S0921-8890(01)00113-0
[24] Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., et al. (2006). Autonomous inverted helicopter flight via reinforcement learning. In M. H. Ang & O. Khatib (Eds.), Experimental robotics IX: The 9th international symposium on experimental robotics (pp. 363-372). Berlin, Heidelberg: Springer.
[25] Niekum, S., and Barto, A. G. (2011). Clustering via dirichlet process mixture models for portable skill discovery. In Advances in Neural Information Processing Systems (NIPS).
[26] Niekum, S., Osentoski, S., Konidaris, G.D., and Barto, A.G. (2012). Learning and generalization of complex tasks from unstructured demonstrations. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
[27] Paraschos, A., Daniel, C., Peters, J., and Neumann, G., (2013). Probabilistic movement primitives. In Advances in Neural Information Processing Systems (NIPS).
[28] Parr, R., and Russell, S. (1998) Reinforcement learning with hierarchies of machines. Advances in Neural Information Processing Systems (NIPS).
[29] Peters, J., Mülling, K., and Altun, Y. (2010). Relative entropy policy search. In Proceedings of the National Conference on Artificial Intelligence (AAAI).
[30] Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley. · Zbl 0829.90134 · doi:10.1002/9780470316887
[31] Ranchod, P., Rosman, B., and Konidaris, G. (2015). Nonparametric bayesian reward segmentation for skill discovery using inverse reinforcement learning. In Intelligent Robots and Systems (IROS) IEEE, 471-477.
[32] Silver, D., and Ciosek, K. (2012). Compositional planning using optimal option models. In Proceedings of the International Conference on Machine Learning (ICML).
[33] Simsek, Ö., and Barto, A. G. (2008). Skill characterization based on betweenness. In Advances in Neural Information Processing Systems (NIPS).
[34] Simsek, Ö., Wolfe, A. P., and Barto, A. G. (2005). Identifying useful subgoals in reinforcement learning by local graph partitioning. In Proceedings of the International Conference on Machine Learning (ICML).
[35] Stolle, M., & Precup, D. (2002). Learning options in reinforcement learning. Abstraction, Reformulation, and Approximation (pp. 212-223). New York City: Springer. · Zbl 1077.68787
[36] Stulp, F., and Schaal, S. (2012). Hierarchical reinforcement learning with movement primitives. In Proceedings of the IEEE International Conference on Humanoid Robots (HUMANOIDS).
[37] Sutton, R. S., Precup, D., and Singh, S. (1998). Intra-option learning about temporally abstract actions. In Proccedings of the International Conference on Machine Learning (ICML).
[38] Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181-211. · Zbl 0996.68151 · doi:10.1016/S0004-3702(99)00052-1
[39] Theodorou, E., Buchli, J., & Schaal, S. (2010). A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11, 3137-3181. · Zbl 1242.68254
[40] van Hoof, H., Peters, J., and Neumann, G. (2015). Learning of non-parametric control policies with high-dimensional state features. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS).
[41] Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292. · Zbl 0773.68062
[42] Wingate, D., Goodman, N. D., Roy, D. M, Kaelbling, L. P., and Tenenbaum, J. B. (2011). Bayesian policy search with policy priors. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.