×

Transductive zero-shot action recognition by word-vector embedding. (English) Zbl 1455.68226

Summary: The number of categories for action recognition is growing rapidly and it has become increasingly hard to label sufficient training data for learning conventional models for all categories. Instead of collecting ever more data and labelling them exhaustively for all categories, an attractive alternative approach is “zero-shot learning” (ZSL). To that end, in this study we construct a mapping between visual features and a semantic descriptor of each action category, allowing new categories to be recognised in the absence of any visual training data. Existing ZSL studies focus primarily on still images, and attribute-based semantic representations. In this work, we explore word-vectors as the shared semantic space to embed videos and category labels for ZSL action recognition. This is a more challenging problem than existing ZSL of still images and/or attributes, because the mapping between video space-time features of actions and the semantic space is more complex and harder to learn for the purpose of generalising over any cross-category domain shift. To solve this generalisation problem in ZSL action recognition, we investigate a series of synergistic strategies to improve upon the standard ZSL pipeline. Most of these strategies are transductive in nature which means access to testing data in the training phase. First, we enhance significantly the semantic space mapping by proposing manifold-regularized regression and data augmentation strategies. Second, we evaluate two existing post processing strategies (transductive self-training and hubness correction), and show that they are complementary. We evaluate extensively our model on a wide range of human action datasets including HMDB51, UCF101, Olympic Sports and event datasets including CCV and TRECVID MED 13. The results demonstrate that our approach achieves the state-of-the-art zero-shot action recognition performance with a simple and efficient pipeline, and without supervised annotation of attributes. Finally, we present in-depth analysis into why and when zero-shot works, including demonstrating the ability to predict cross-category transferability in advance.

MSC:

68T45 Machine vision and scene understanding
68T05 Learning and adaptive systems in artificial intelligence

References:

[1] Aggarwal, J., & Ryoo, M. (2011). Human activity analysis: A review. ACM Computer Survey, 43(3), 16. · doi:10.1145/1922649.1922653
[2] Akata, Z., Reed, S., Walter, D., Lee, H., & Schiele, B. (2015). Evaluation of output embeddings for fine-grained image classification. In CVPR (pp. 2927-2936).
[3] Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research, 7, 2399-2434. · Zbl 1222.68144
[4] Chen, J., Cui, Y., Ye, G., Liu, D., & Chang, SF. (2014). Event-driven semantic concept discovery by exploiting weakly tagged internet images. In ICMR (p. 1).
[5] Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). Imagenet: A large-scale hierarchical image database. In CVPR (pp. 248-255).
[6] Dinu, G., Lazaridou, A., & Baroni, M. (2015). Improving zero-shot learning by mitigating the hubness problem. In ICLR, Workshop Track.
[7] Frome, A., Corrado, G. S., & Shlens, J. (2013). Devise: A deep visual-semantic embedding model. In NIPS (pp. 2121-2129).
[8] Fu, Y., Hospedales, T. M., Xiang, T., & Gong, S. (2012). Attribute learning for understanding unstructured social activity. In ECCV (pp. 530-543).
[9] Fu, Y., Hospedales, T. M., Xiang, T., Fu, Z., & Gong, S. (2014a). Transductive multi-view embedding for zero-shot recognition and annotation. In ECCV (pp. 584-599).
[10] Fu, Y., Hospedales, T. M., Xiang, T., & Gong, S. (2014b). Learning multimodal latent attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 303-316. · doi:10.1109/TPAMI.2013.128
[11] Fu, Y., Yang, Y., & Gong, S. (2014c). Transductive multi-label zero-shot learning. In BMVC (pp. 1-5).
[12] Fu, Y., Hospedales, T. M., Xiang, T., & Gong, S. (2015a). Transductive multi-view zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11), 2332-2345. · doi:10.1109/TPAMI.2015.2408354
[13] Fu, Z., Xiang, T., Kodirov, E., & Gong, S. (2015b). Zero-shot object recognition by semantic manifold distance. In CVPR (pp. 2635-2644).
[14] Gan, C., Lin, M., Yang, Y., Zhuang, Y., & GHauptmann, A. (2015). Exploring semantic inter-class relationships (sir) for zero-shot action recognition. In AAAI (pp. 3769-3775).
[15] Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2247-2253. · doi:10.1109/TPAMI.2007.70711
[16] Habibian, A., Mensink, T., & Snoek, C. G. (2014a). Composite concept discovery for zero-shot video event detection. In ICMR (p. 17).
[17] Habibian, A., Mensink, T., & Snoek, C. G. M. (2014b). VideoStory: A new multimedia embedding for few-example recognition and translation of events. In ACM Multimedia (pp. 17-26).
[18] Jain, M., & Snoek, C. G. M. (2015). What do 15,000 object categories tell us about classifying and localizing actions? In CVPR (pp. 46-55).
[19] Jain, M., van Gemert, J. C., Mensink, T., & Snoek, C. G. M. (2015). Objects2action: Classifying and localizing actions without any video example. In ICCV (pp. 4588-4596).
[20] Jiang, Y., Wu, Z., Wang, J., Xue, X., & Chang, S. (2015). Exploiting feature and class relationships in video categorization with regularized deep neural networks. arXiv preprint arXiv:1502.07209.
[21] Jiang, Y. G., Ye, G., Chang, S. F., Ellis, D. P. W., & Loui, A. C. (2011). Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ICMR (p. 29).
[22] Jiang, Y. G., Liu, J., Zamir A. R., Laptev, I., Piccardi, M., Shah, M., & Sukthankar, R. (2013). THUMOS challenge: Action recognition with a large number of classes.
[23] Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC (pp 1-10).
[24] Kodirov, E., Xiang, T., Fu, Z., & Gong, S. (2015). Unsupervised domain adaptation for zero-shot learning. In ICCV (pp 2452-2460).
[25] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In ICCV (pp 2556-2563).
[26] Lampert, CH., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR (pp 951-958).
[27] Lampert, C. H., Nickisch, H., & Harmeling, S. (2014). Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 453-465. · doi:10.1109/TPAMI.2013.140
[28] Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2-3), 107-123. · doi:10.1007/s11263-005-1838-7
[29] Larochelle, H., Erhan, D., & Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI (pp 646-651).
[30] Lazaridou, A., Bruni, E., & Baroni, M. (2014). Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world. In ACL (pp 1403-1414).
[31] Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR (pp 3337-3344).
[32] Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579-2605. · Zbl 1225.68219
[33] Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In CVPR (pp 2929-2936).
[34] Mensink, T., Gavves, E., & Snoek, C. G. (2014). Costa: Co-occurrence statistics for zero-shot classification. In CVPR (pp. 2441-2448).
[35] Mikolov, T., Sutskever, I., & Chen, K. (2013). Distributed representations of words and phrases and their compositionality. In NIPS (pp. 3111-3119).
[36] Milajevs, D., Kartsaklis, D., Sadrzadeh, M., & Purver, M. (2014). Evaluating neural word representations in tensor-based compositional settings. In EMNLP (pp. 708-719).
[37] Mitchell, J., & Lapata, M. (2008). Vector-based models of semantic composition. In ACL (pp. 236-244).
[38] Niebles, CWFFL Juan Carlos Chen. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV (pp 392-405).
[39] Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G. S., & Dean, J. (2014). Zero-shot learning by convex combination of semantic embeddings. In ICLR.
[40] Over, P., Fiscus, J., Sanders, G., Joy, D., Michel, M., Smeaton-Alan, A. F., & Quénot-Georges, G. (2014). Trecvid 2013—An overview of the goals, tasks, data, evaluation mechanisms, and metrics.
[41] Palatucci, M., Hinton, G., Pomerleau, D., & Mitchell, T. M. (2009). Zero-shot learning with semantic output codes. In NIPS (pp. 1410-1418).
[42] Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359. · doi:10.1109/TKDE.2009.191
[43] Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV (pp. 143-156).
[44] Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976-990. · doi:10.1016/j.imavis.2009.11.014
[45] Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I., & Schiele, B. (2010). What helps where- and why? Semantic relatedness for knowledge transfer. In CVPR (pp. 910-917).
[46] Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR (pp. 1641-1648).
[47] Rohrbach, M., Ebert, S., & Schiele, B. (2013). Transfer learning in a transductive setting. In NIPS (pp. 46-54).
[48] Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., et al. (2016). Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV, 119(3), 346-373. · doi:10.1007/s11263-015-0851-8
[49] Romera-Paredes, B., & Torr, P. H. S. (2015). An embarrassingly simple approach to zero-shot learning. In ICML (pp. 2152-2161).
[50] Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.
[51] Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR (pp. 32-36).
[52] Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In ACM Multimedia (pp. 357-360).
[53] Shao, L., Zhu, F., & Li, X. (2015). Transfer learning for visual categorization: A survey. IEEE Transactions on Neural Networks and Learning Systems, 26(5), 1019-1034. · doi:10.1109/TNNLS.2014.2330900
[54] Socher, R., Ganjoo, M., Manning, CD., & Ng, A. Y. (2013). Zero-shot learning through cross-modal transfer. In NIPS (pp. 935-943).
[55] Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
[56] Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV (pp. 3551-3558).
[57] Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60-79. · doi:10.1007/s11263-012-0594-8
[58] Wang, H., Oneata, D., Verbeek, J., & Schmid, C. (2016). A robust and efficient video representation for action recognition. International Journal of Computer Vision, 119(3), 219-238. · doi:10.1007/s11263-015-0846-5
[59] Wu, S., Bondugula, S., Luisier, F., Zhuang, X., & Natarajan, P. (2014). Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In CVPR (pp. 2665-2672).
[60] Xu, X., Hospedales, T., & Gong, S. (2015). Semantic embedding space for zero shot action recognition. In ICIP (pp. 63-67).
[61] Yang, Y., & Hospedales, T. (2015). A unified perspective on multi-domain and multi-task learning. In ICLR.
[62] Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In ICCV (pp. 492-497).
[63] Zhao, F., Huang, Y., Wang, L., & Tan, T. (2013). Relevance topic model for unstructured social group activity recognition. In NIPS (pp. 2580-2588).
[64] Zheng, J., & Jiang, Z. (2014). Submodular attribute selection for action recognition in video. In NIPS (pp. 1-9).
[65] Zhou, D., Bousquet, O., & Weston, J. (2004). Learning with local and global consistency. In NIPS, (pp. 321-328).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.