×

Cross-modal dual subspace learning with adversarial network. (English) Zbl 1468.68172

Summary: Cross-modal retrieval has recently attracted much interest along with the rapid development of multimodal data, and effectively utilizing the complementary relationship of different modal data and eliminating the heterogeneous gap as much as possible are the two key challenges. In this paper, we present a novel network model termed cross-modal Dual Subspace learning with Adversarial Network (DSAN). The main contributions are as follows: (1) Dual subspaces (visual subspace and textual subspace) are proposed, which can better mine the underlying structure information of different modalities as well as modality-specific information. (2) An improved quadruplet loss is proposed, which takes into account the relative distance and absolute distance between positive and negative samples, together with the introduction of the idea of hard sample mining. (3) Intra-modal constrained loss is proposed to maximize the distance of the most similar cross-modal negative samples and their corresponding cross-modal positive samples. In particular, feature preserving and modality classification act as two antagonists. DSAN tries to narrow the heterogeneous gap between different modalities, and distinguish the original modality of random samples in dual subspaces. Comprehensive experimental results demonstrate that, DSAN significantly outperforms 9 state-of-the-art methods on four cross-modal datasets.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

Software:

InfoGAN; PL-ranking
Full Text: DOI

References:

[1] Chang, X.; Yang, Y., Semisupervised feature analysis by mining correlations among multiple tasks, IEEE Transactions on Neural Networks Learning Systems, 28, 10, 2294-2305 (2016)
[2] Chen, W.-C.; Chen, C.-W.; Hu, M.-C., Syncgan: Synchronize the latent spaces of cross-modal generative adversarial networks, (2018 IEEE international conference on multimedia and expo (2018), IEEE), 1-6
[3] Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P., Infogan: Interpretable representation learning by information maximizing generative adversarial nets, (Advances in neural information processing systems (2016)), 2172-2180
[4] Feng, F.; Wang, X.; Li, R., Cross-modal retrieval with correspondence autoencoder, (Proceedings of the 22nd ACM international conference on multimedia (2014), ACM), 7-16
[5] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S., Generative adversarial nets, (Advances in neural information processing systems (2014)), 2672-2680
[6] He, L.; Xu, X.; Lu, H.; Yang, Y.; Shen, F.; Shen, H. T., Unsupervised cross-modal retrieval through adversarial learning, (2017 IEEE international conference on multimedia and expo (2017), IEEE), 1153-1158
[7] He, R.; Zhang, M.; Wang, L.; Ji, Y.; Yin, Q., Cross-modal subspace learning via pairwise constraints, IEEE Transactions on Image Processing, 24, 12, 5543-5556 (2015) · Zbl 1408.94238
[8] Huang, X., & Peng, Y. (2018). Deep cross-media knowledge transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8837-8846).
[9] Jiang, Q. -Y., & Li, W. -J. (2017). Deep cross-modal hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3232-3240).
[10] Li, K.; Qi, G.-J.; Ye, J.; Hua, K. A., Linear subspace ranking hashing for cross-modal retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 9, 1825-1838 (2016)
[11] Liong, V. E.; Lu, J.; Tan, Y.-P.; Zhou, J., Deep coupled metric learning for cross-modal matching, IEEE Transactions on Multimedia, 19, 6, 1234-1244 (2016)
[12] Lu, X., Zhu, L., Cheng, Z., Nie, L., & Zhang, H. (2019). Online multi-modal hashing with dynamic query-adaption. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval (pp. 715-724).
[13] Mao, X.; Wang, S.; Zheng, L.; Huang, Q., Semantic invariant cross-domain image generation with generative adversarial networks, Neurocomputing, 293, 55-63 (2018)
[14] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (pp. 689-696).
[15] Peng, Y., Huang, X., & Qi, J. (2016). Cross-Media shared representation by hierarchical learning with multiple deep networks. In IJCAI (pp. 3846-3853).
[16] Peng, Y.; Qi, J., Cm-gans: cross-modal generative adversarial networks for common representation learning, ACM Transactions on Multimedia Computing, Communications, and Applications, 15, 1, 22 (2019)
[17] Peng, Y.; Qi, J.; Huang, X.; Yuan, Y., CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network, IEEE Transactions on Multimedia, 20, 2, 405-420 (2017)
[18] Peng, Y.; Qi, J.; Yuan, Y., Modality-specific cross-modal similarity measurement with recurrent attention network, IEEE Transactions on Image Processing, 27, 11, 5585-5599 (2018)
[19] Peng, Y.; Zhai, X.; Zhao, Y.; Huang, X., Semi-supervised cross-media feature learning with unified patch graph regularization, IEEE Transactions on Circuits and Systems for Video Technology, 26, 3, 583-596 (2015)
[20] Shang, F.; Zhang, H.; Sun, J.; Liu, L., Semantic consistency cross-modal dictionary learning with rank constraint, Journal of Visual Communication and Image Representation, 62, 259-266 (2019)
[21] Shang, F.; Zhang, H.; Zhu, L.; Sun, J., Adversarial cross-modal retrieval based on dictionary learning, Neurocomputing, 355, 93-104 (2019)
[22] Srivastava, N.; Salakhutdinov, R., Learning representations for multimodal data with deep belief nets, (International conference on machine learning workshop, Vol. 79 (2012))
[23] Wang, K., He, R., Wang, W., Wang, L., & Tan, T. (2013). Learning coupled feature spaces for cross-modal matching. In Proceedings of the IEEE international conference on computer vision (pp. 2088-2095).
[24] Wang, K.; He, R.; Wang, L.; Wang, W.; Tan, T., Joint feature selection and subspace learning for cross-modal retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 10, 2010-2023 (2015)
[25] Wang, L.; Sun, W.; Zhao, Z.; Su, F., Modeling intra-and inter-pair correlation via heterogeneous high-order preserving for cross-modal retrieval, Signal Processing, 131, 249-260 (2017)
[26] Wang, C.; Yang, H.; Meinel, C., Deep semantic mapping for cross-modal retrieval, (2015 IEEE 27th international conference on tools with artificial intelligence (2015), IEEE), 234-241
[27] Wang, B.; Yang, Y.; Xu, X.; Hanjalic, A.; Shen, H. T., Adversarial cross-modal retrieval, (Proceedings of the 25th ACM international conference on multimedia (2017), ACM), 154-162
[28] Wei, X.-S.; Luo, J.-H.; Wu, J.; Zhou, Z.-H., Selective convolutional descriptor aggregation for fine-grained image retrieval, IEEE Transactions on Image Processing, 26, 6, 2868-2881 (2017) · Zbl 1409.94655
[29] Wei, Y.; Zhao, Y.; Zhu, Z.; Wei, S.; Xiao, Y.; Feng, J., Modality-dependent cross-media retrieval, ACM Transactions on Intelligent Systems and Technology (TIST), 7, 4, 57 (2016)
[30] Wu, G.; Han, J.; Guo, Y.; Liu, L.; Ding, G.; Ni, Q., Unsupervised deep video hashing via balanced code for large-scale video retrieval, IEEE Transactions on Image Processing, 28, 4, 1993-2007 (2018)
[31] Wu, J.; Lin, Z.; Zha, H., Joint latent subspace learning and regression for cross-modal retrieval, (Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval (2017), ACM), 917-920
[32] Wu, J.; Lin, Z.; Zha, H., Joint dictionary learning and semantic constrained latent subspace projection for cross-modal retrieval, (Proceedings of the 27th ACM international conference on information and knowledge management (2018), ACM), 1663-1666
[33] Xia, Y.; Wang, W.; Han, L., Dual subspaces with adversarial learning for cross-modal retrieval, (Pacific rim conference on multimedia (2018), Springer), 654-663
[34] Xiao, Q.; Luo, H.; Zhang, C., Margin sample mining loss: A deep learning based method for person re-identification (2017), arXiv preprint arXiv:1710.00478
[35] Xu, X.; He, L.; Lu, H.; Gao, L.; Ji, Y., Deep adversarial metric learning for cross-modal retrieval, World Wide Web, 22, 2, 657-672 (2019)
[36] Ye, M., Wang, Z., Lan, X., & Yuen, P. C. (2018). Visible thermal person re-identification via dual-constrained top-ranking. In IJCAI (pp. 1092-1099).
[37] Zhai, X.; Peng, Y.; Xiao, J., Learning cross-media joint representation with sparse and semisupervised regularization, IEEE Transactions on Circuits and Systems for Video Technology, 24, 6, 965-978 (2013)
[38] Zhang, H.; Cao, L.; Gao, S., A locality correlation preserving support vector machine, Pattern Recognition, 47, 9, 3168-3178 (2014) · Zbl 1342.68274
[39] Zhang, M.; Li, J.; Zhang, H.; Liu, L., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, 381, 240-251 (2020)
[40] Zhang, L.; Ma, B.; Li, G.; Huang, Q.; Tian, Q., Cross-modal retrieval using multiordered discriminative structured subspace learning, IEEE Transactions on Multimedia, 19, 6, 1220-1233 (2016)
[41] Zhang, L.; Ma, B.; Li, G.; Huang, Q.; Tian, Q., Pl-ranking: A novel ranking method for cross-modal retrieval, (Proceedings of the 24th ACM international conference on multimedia (2016), ACM), 1355-1364
[42] Zhang, L.; Ma, B.; Li, G.; Huang, Q.; Tian, Q., Generalized semi-supervised and structured subspace learning for cross-modal retrieval, IEEE Transactions on Multimedia, 20, 1, 128-141 (2017)
[43] Zhang, J., Peng, Y., & Yuan, M. (2018). Unsupervised generative adversarial cross-modal hashing. In Thirty-Second AAAI conference on artificial intelligence (pp. 539-546).
[44] Zhang, B.; Zhu, L.; Sun, J.; Zhang, H., Cross-media retrieval with collective deep semantic learning, Multimedia Tools and Applications, 77, 17, 22247-22266 (2018)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.