Abstract
This paper introduces the Crossmodal Attentive Skill Learner (CASL), integrated with the recently-introduced Asynchronous Advantage Option-Critic architecture [Harb et al. in When waiting is not an option: learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017] to enable hierarchical reinforcement learning across multiple sensory inputs. Agents trained using our approach learn to attend to their various sensory modalities (e.g., audio, video) at the appropriate moments, thereby executing actions based on multiple sensory streams without reliance on supervisory data. We demonstrate empirically that the sensory attention mechanism anticipates and identifies useful latent features, while filtering irrelevant sensor modalities during execution. Further, we provide concrete examples in which the approach not only improves performance in a single task, but accelerates transfer to new tasks. We modify the Arcade Learning Environment [Bellemare et al. in J Artif Intell Res 47:253–279, 2013] to support audio queries (ALE-audio code available at https://github.com/shayegano/Arcade-Learning-Environment), and conduct evaluations of crossmodal learning in the Atari 2600 games H.E.R.O. and Amidar. Finally, building on the recent work of Babaeizadeh et al. [in: International conference on learning representations (ICLR), 2017], we open-source a fast hybrid CPU–GPU implementation of CASL (CASL code available at https://github.com/shayegano/CASL).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Note that HRL with a single option is equivalent to a normal RL with primitive actions. Therefore, a single option CASL corresponds to A3C [30] but with the crossmodal attention mechanism.
References
Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., & Abbeel, P. (2018). Continuous adaptation via meta-learning in nonstationary and competitive environments. In International Conference on Learning Representations (ICLR).
Alvis, C. D., Ott, L., & Ramos, F. (2017). Online learning for scene segmentation with laser-constrained CRFs. International Conference on Robotics and Automation (ICRA), 4639–4643.
Andreas, J., Klein, D., & Levine, S. (2016). Modular multitask reinforcement learning with policy sketches. arXiv preprint arXiv:1611.01796.
Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.
Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., & Kautz, J. (2017). Reinforcement learning through asynchronous advantage actor-critic on a GPU. In International Conference on Learning Representations (ICLR).
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Beal, M. J., Attias, H., & Jojic, N. (2002). Audio-video sensor fusion with probabilistic graphical models. European Conference on Computer Vision (ECCV), 736–750.
Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279.
Bengio, S. (2002). An asynchronous hidden markov model for audio-visual speech recognition. Advances in Neural Information Processing Systems (NIPS), 1237–1244.
Cadena, C., & Košecká, J. (2014). Semantic segmentation with heterogeneous sensor coverages. International Conference on Robotics and Automation (ICRA), 2639–2645.
Caglayan, O., Barrault, L., & Bougares, F. (2016). Multimodal attention for neural machine translation. arXiv preprint arXiv:1609.03976.
Carrasco, M. (2011). Visual attention: The past 25 years. Vision research, 51(13), 1484–1525.
Chambers, A. D., Scherer, S., Yoder, L., Jain, S., Nuske, S. T., & Singh, S. (2014). Robust multi-sensor fusion for micro aerial vehicle navigation in GPS-degraded/denied environments. In American Control Conference (ACC).
Da Silva, B., Konidaris, G., & Barto, A. (2012). Learning parameterized skills. arXiv preprint arXiv:1206.6398.
Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., & Burgard, W. (2015). Multimodal deep learning for robust RGB-D object recognition. In International Conference on Intelligent Robots and Systems (IROS).
Harb, J., Bacon, P.-L., Klissarov, M., & Precup, D. (2017). When waiting is not an option: Learning options with a deliberation cost. arXiv preprint arXiv:1709.04571.
Hausknecht, M., & Stone, P. (2015). Deep recurrent Q-learning for partially observable MDPs. arXiv preprint arXiv:1507.06527.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. CoRR, arXiv:1512.03385.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1), 99–134.
Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
Konidaris, G., & Barto, A. G. (2009). Skill discovery in continuous reinforcement learning domains using skill chaining. Advances in Neural Information Processing Systems (NIPS), 1015–1023.
Kulkarni, T. D., Narasimhan, K., Saeedi, A., & Tenenbaum, J. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in Neural Information Processing Systems (NIPS), 3675–3683.
Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning (ICML), 282–289.
Leong, Y. C., Radulescu, A., Daniel, R., DeWoskin, V., & Niv, Y. (2017). Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron, 93(2), 451–463.
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
Lynen, S., Achtelik, M. W., Weiss, S., Chli, M., & Siegwart, R. (2013). A robust and modular multi-sensor fusion approach applied to mav navigation. International Conference on Intelligent Robots and Systems (IROS), 3923–3929.
Machado, M. C., Bellemare, M. G., & Bowling, M. (2017). A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956.
Mackintosh, N. J. (1975). A theory of attention: Variations in the associability of stimuli with reinforcement. Psychological Review, 82(4), 276.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., et al. (2016). Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning (ICML), 1928–1937.
Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention. Advances in Neural Information Processing Systems (NIPS), 2204–2212.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Nachum, O., Gu, S. S., Lee, H., & Levine, S. (2018). Data-efficient hierarchical reinforcement learning. Advances in Neural Information Processing Systems (NIPS), 3306–3317.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. International Conference on Machine Learning (ICML), 689–696.
Niv, Y., Daniel, R., Geana, A., Gershman, S. J., Leong, Y. C., Radulescu, A., et al. (2015). Reinforcement learning in multidimensional environments relies on attention mechanisms. Journal of Neuroscience, 35(21), 8145–8157.
Nobili, S., Camurri, M., Barasuol, V., Focchi, M., Caldwell, D. G., Semini, C., & Fallon, M. (2017). Heterogeneous sensor fusion for accurate state estimation of dynamic legged robots. In Robotics: Science and Systems (RSS).
Pearce, J. M., & Hall, G. (1980). A model for pavlovian learning: Variations in the effectiveness of conditioned but not of unconditioned stimuli. Psychological Review, 87(6), 532.
Pearce, J. M., & Mackintosh, N. J. (2010). Two theories of attention: A review and a possible integration. Attention and Associative Learning: From Brain to Behaviour, pages 11–39.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.
Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A., & Ignateva, A. (2015). Deep attention recurrent Q-network. arXiv preprint arXiv:1512.01693.
Srivastava, N., & Salakhutdinov, R. (2014). Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research, 15, 2949–2980.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction.
Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211.
Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., & Kavukcuoglu, K. (2017). Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161.
Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., & Hinton, G. (2015). Grammar as a foreign language. Advances in Neural Information Processing Systems (NIPS), 2773–2781.
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. J., & Hovy, E. H. (2016). Hierarchical attention networks for document classification. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1480–1489.
Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., & Fei-Fei, L. (2015). Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 1–15.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Jason Pazis: Work done prior to Amazon involvement of Jason Pazis, and does not reflect views of the Amazon company.
The work was supported by Boeing Research and Technology, ONR MURI Grant N000141110688, BRC Grant N000141712072, IBM (as part of the MIT-IBM Watson AI Lab initiative), and AWS Machine Learning Research Awards program. We also thank the three anonymous reviewers for their helpful suggestions.
Rights and permissions
About this article
Cite this article
Kim, DK., Omidshafiei, S., Pazis, J. et al. Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs. Auton Agent Multi-Agent Syst 34, 16 (2020). https://doi.org/10.1007/s10458-019-09439-5
Published:
DOI: https://doi.org/10.1007/s10458-019-09439-5