×

Zero-shot visual recognition via bidirectional latent embedding. (English) Zbl 1458.68256

Summary: Zero-shot learning for visual recognition, e.g., object and action recognition, has recently attracted a lot of attention. However, it still remains challenging in bridging the semantic gap between visual features and their underlying semantics and transferring knowledge to semantic categories unseen during learning. Unlike most of the existing zero-shot visual recognition methods, we propose a stagewise bidirectional latent embedding framework of two subsequent learning stages for zero-shot visual recognition. In the bottom-up stage, a latent embedding space is first created by exploring the topological and labeling information underlying training data of known classes via a proper supervised subspace learning algorithm and the latent embedding of training data are used to form landmarks that guide embedding semantics underlying unseen classes into this learned latent space. In the top-down stage, semantic representations of unseen-class labels in a given label vocabulary are then embedded to the same latent space to preserve the semantic relatedness between all different classes via our proposed semi-supervised Sammon mapping with the guidance of landmarks. Thus, the resultant latent embedding space allows for predicting the label of a test instance with a simple nearest-neighbor rule. To evaluate the effectiveness of the proposed framework, we have conducted extensive experiments on four benchmark datasets in object and action recognition, i.e., AwA, CUB-200-2011, UCF101 and HMDB51. The experimental results under comparative studies demonstrate that our proposed approach yields the state-of-the-art performance under inductive and transductive settings.

MSC:

68T45 Machine vision and scene understanding
68T05 Learning and adaptive systems in artificial intelligence
68T10 Pattern recognition, speech recognition

References:

[1] Akata, Z., Lee, H., & Schiele, B. (2014). Zero-shot learning with structured embeddings. arXiv:1409.8403.
[2] Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2013). Label-embedding for attribute-based classification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 819-826).
[3] Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2016). Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 1425-1438. · doi:10.1109/TPAMI.2015.2487986
[4] Akata, Z., Reed, S., Walter, D., Lee, H., & Schiele, B. (2015). Evaluation of output embeddings for fine-grained image classification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2927-2936).
[5] Al-Halah, Z., & Stiefelhagen, R. (2015). How to transfer? Zero-shot object recognition via hierarchical transfer of semantic attributes. In IEEE winter conference on applications of computer vision (WACV) (pp. 837-843). IEEE.
[6] Andreopoulos, A., & Tsotsos, J. K. (2013). 50 years of object recognition: Directions forward. Computer Vision and Image Understanding, 117, 827-891. · doi:10.1016/j.cviu.2013.04.005
[7] Cai, D., He, X., & Han, J. (2007). Semi-supervised discriminant analysis. In International conference on computer vision (pp. 1-7). IEEE.
[8] Changpinyo, S., Chao, W.-L., Gong, B., & Sha, F. (2016a). Synthesized classifiers for zero-shot learning. In IEEE conference on computer vision and pattern recognition (CVPR).
[9] Changpinyo, S., Chao, W.-L., & Sha, F. (2016b). Predicting visual exemplars of unseen classes for zero-shot learning. arXiv:1605.08151.
[10] Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In British machine vision conference (BMVC).
[11] Cheng, J., Liu, Q., Lu, H., & Chen, Y.-W. (2005). Supervised kernel locality preserving projections for face recognition. Neurocomputing, 67, 443-449. · doi:10.1016/j.neucom.2004.08.006
[12] Cox, T. F., & Cox, M . A. (2000). Multidimensional scaling. Boca Raton: CRC press.
[13] Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. · Zbl 0994.68074 · doi:10.1017/CBO9780511801389
[14] Dinu, G., Lazaridou, A., & Baroni, M. (2015). Improving zero-shot learning by mitigating the hubness problem. In International conference on learning representations workshop.
[15] Elhoseiny, M., Elgammal, A., & Saleh, B. (2015). Tell and predict: Kernel classifier prediction for unseen visual classes from unstructured text descriptions. In IEEE conference on computer vision and pattern recognition (CVPR) workshop on language and vision.
[16] Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Mikolov, T. et al. (2013). Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems (pp. 2121-2129).
[17] Fu, Y., Hospedales, T. M., Xiang, T., & Gong, S. (2015). Transductive multi-view zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 2332-2345. · doi:10.1109/TPAMI.2015.2408354
[18] Fu, Y., & Huang, T. (2010). Manifold and subspace learning for pattern recognition. Pattern Recognition and Machine Vision, 6, 215.
[19] Gan, C., Lin, M., Yang, Y., Zhuang, Y., & Hauptmann, A. G. (2015). Exploring semantic inter-class relationships (SIR) for zero-shot action recognition. In Twenty-ninth AAAI conference on artificial intelligence.
[20] Gan, C., Yang, T., & Gong, B. (2016). Learning attributes equals multi-source domain generalization. In IEEE conference on computer vision and pattern recognition (CVPR).
[21] Gong, Y., Ke, Q., Isard, M., & Lazebnik, S. (2014). A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision, 106, 210-233. · doi:10.1007/s11263-013-0658-4
[22] Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Technical report 7694. California Institute of Technology. http://www.vision.caltech.edu/Image_Datasets/Caltech256/.
[23] Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16, 2639-2664. · Zbl 1062.68134 · doi:10.1162/0899766042321814
[24] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770-778).
[25] Jayaraman, D., & Grauman, K. (2014). Zero-shot recognition with unreliable attributes. In Advances in neural information processing systems (pp. 3464-3472).
[26] Jiang, Y.-G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., & Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/.
[27] Jolliffe, I. (2002). Principal component analysis. Hoboken: Wiley Online Library. · Zbl 1011.62064
[28] Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3128-3137).
[29] Kodirov, E., Xiang, T., Fu, Z., & Gong, S. (2015). Unsupervised domain adaptation for zero-shot learning. In IEEE international conference on computer vision (ICCV) (pp. 2452-2460).
[30] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: a large video database for human motion recognition. In IEEE international conference on computer vision (ICCV) (pp. 2556-2563). IEEE.
[31] Lampert, C. H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 951-958). IEEE.
[32] Lampert, C. H., Nickisch, H., & Harmeling, S. (2014). Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 453-465. · doi:10.1109/TPAMI.2013.140
[33] Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3337-3344). IEEE.
[34] Mensink, T., Gavves, E., & Snoek, C. (2014). COSTA: Co-occurrence statistics for zero-shot classification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2441-2448).
[35] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
[36] Niyogi, X. (2004). Locality preserving projections. In Neural information processing systems (Vol. 16, p. 153). MIT.
[37] Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G. S., & Dean, J. (2014). Zero-shot learning by convex combination of semantic embeddings. In International conference on learning representations (ICLR).
[38] Peng, X., Wang, L., Wang, X., & Qiao, Y. (2016). Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding, 150, 109-125. · doi:10.1016/j.cviu.2016.03.013
[39] Radovanović, M., Nanopoulos, A., & Ivanović, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. The Journal of Machine Learning Research, 11, 2487-2531. · Zbl 1242.62056
[40] Reed, S., Akata, Z., Schiele, B., & Lee, H. (2016). Learning deep representations of fine-grained visual descriptions. In IEEE conference on computer vision and pattern recognition (CVPR).
[41] Romera-Paredes, B., & Torr, P. (2015). An embarrassingly simple approach to zero-shot learning. In International conference on machine learning (ICML) (pp. 2152-2161).
[42] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115, 211-252. · doi:10.1007/s11263-015-0816-y
[43] Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18, 401-409. · doi:10.1109/T-C.1969.222678
[44] Shao, L., Liu, L., & Yu, M. (2016). Kernelized multiview projection for robust action recognition. International Journal of Computer Vision, 118, 115-129. · doi:10.1007/s11263-015-0861-6
[45] Shao, L., Zhen, X., Tao, D., & Li, X. (2014). Spatio-temporal laplacian pyramid coding for action recognition. IEEE Transactions on Cybernetics, 44, 817-827. · doi:10.1109/TCYB.2013.2273174
[46] Shigeto, Y., Suzuki, I., Hara, K., Shimbo, M., and Matsumoto, Y. (2015). Ridge regression, hubness, and zero-shot learning. In Machine learning and knowledge discovery in databases (pp. 135-151). Springer.
[47] Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568-576).
[48] Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations.
[49] Smola, A., & Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Systems, 9, 155-161.
[50] Solmaz, B., Assari, S. M., & Shah, M. (2013). Classifying web videos using a global video descriptor. Machine Vision and Applications, 24, 1473-1485. · doi:10.1007/s00138-012-0449-x
[51] Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01.
[52] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition (pp. 1-9).
[53] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In International conference on computer vision (ICCV) (pp. 4489-4497).
[54] Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453-1484. · Zbl 1222.68321
[55] Vedaldi, A., & Lenc, K. (2015). Matconvnet—Convolutional neural networks for matlab. In ACM international conference on multimedia.
[56] Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset. Technical report CNS-TR-2010-001. California Institute of Technology. http://www.vision.caltech.edu/visipedia/CUB-200-2011.html.
[57] Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE international conference on computer vision (ICCV) (pp. 3551-3558). IEEE.
[58] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (ECCV).
[59] Wu, Z., Jiang, Y.-G., Wang, X., Ye, H., Xue, X., & Wang, J. (2016). Multi-stream multi-class fusion of deep networks for video classification. In ACM multimedia (ACM MM).
[60] Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., & Schiele, B. (2016). Latent embeddings for zero-shot classification. In IEEE conference on computer vision and pattern recognition (CVPR).
[61] Xu, X., Hospedales, T., & Gong, S. (2015a). Semantic embedding space for zero-shot action recognition. In IEEE international conference on image processing (ICIP) (pp. 63-67). IEEE.
[62] Xu, X., Hospedales, T., & Gong, S. (2015b). Zero-shot action recognition by word-vector embedding. arXiv:1511.04458.
[63] Yu, M., Liu, L., & Shao, L. (2015). Kernelized multiview projection. arXiv:1508.00430.
[64] Zhang, H., Deng, W., Guo, J., & Yang, J. (2010). Locality preserving and global discriminant projection with prior information. Machine Vision and Applications, 21, 577-585. · doi:10.1007/s00138-009-0213-z
[65] Zhang, Z., & Saligrama, V. (2015). Zero-shot learning via semantic similarity embedding. In IEEE international conference on computer vision (ICCV) (pp. 4166-4174).
[66] Zhang, Z., & Saligrama, V. (2016a). Zero-shot learning via joint latent similarity embedding. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6034-6042).
[67] Zhang, Z., & Saligrama, V. (2016b). Zero-shot recognition via structured prediction. In European conference on computer vision (pp. 533-548). Springer.
[68] Zhao, S., Liu, Y., Han, Y., & Hong, R. (2015). Pooling the convolutional layers in deep convnets for action recognition. arXiv:1511.02126.
[69] Zheng, Z., Yang, F., Tan, W., Jia, J., & Yang, J. (2007). Gabor feature-based face recognition using supervised locality preserving projection. Signal Processing, 87, 2473-2483. · Zbl 1186.94401 · doi:10.1016/j.sigpro.2007.03.006
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.