Abstract
This paper focuses on the comparison between two fusion methods, namely early fusion and late fusion. The former fusion is carried out at kernel level, also known as multiple kernel learning, and in the latter, the modalities are fused through logistic regression at classifier score level. Two kinds of multilayer fusion structures, differing in the quantities of feature/kernel groups in a lower fusion layer, are constructed for early and late fusion systems, respectively. The goal of these fusion methods is to put each of various features into effect and mine redundant information of the combination of them, and then to develop a generic and robust semantic indexing system to bridge semantic gap between human concepts and these low-level visual features. Performance evaluated on both TRECVID2009 and TRECVID2010 datasets demonstrates that the systems with our proposed multilayer fusion methods at kernel level perform more stably to reach the goal than the classification-score-level fusion; the most effective and robust one with highest MAP score is constructed by early fusion with two-layer equally weighted composite kernel learning.
Similar content being viewed by others
References
Lienhart R, Kuhmunch C, Effelsberg W (1997) On the detection and recognition of television commercials. In: Proceeding of the IEEE conference on multimedia computing and systems, pp 509–516
Zhang H, Tan SY, Smoliar SW, Yihong G (1995) Automatic parsing and indexing of news video. Multimed Syst 2:256–266
Rui Y, Gupta A, Acero A (2000) Automatically extracting highlights for TV baseball programs. In: Proceedings of the eighth ACM international conference on multimedia, pp 105–115
Snoek G, Worring M et al (2006) The semantic pathfinder: using an authoring metaphor for generic multimedia indexing. IEEE Trans Pattern Anal Mach Intell 28:1678–1689
Cees G.M. Snoek, Koen E.A. van de Sande et al (2010) The MediaMill TRECVID 2010 Semantic Video Search Engine TRECVID Workshop
Cees G.M. Snoek et al (2005) Early versus late fusion in semantic video analysis. In: ACM MM’05
Kieran Mc Donald, Alan F. Smeaton (2005) A comparison of score, rank and probability-based fusion methods for video shot retrieval
Ayache S, Gensel J, Qu’enot GM (2006) Clips-lsr experiments at trecvid 2006—draft. In:TREC Video Retrieval Workshop, NIST
Dong Y et al (2009) The france telecom orange labs (beijing) video high-level feature extraction systems—trecvid 2009 notebook paper. TRECVID Workshop
Dong Y, Tao K et al (2010) The france telecom orange labs (beijing) video semantic indexing systems—trecvid 2010 notebook paper. TRECVID Workshop
Amir A, Argillander J, Campbell M et al (2005) IBM research trecvid-2005 video retrieval system. NIST TRECVID-2005 Workshop
Souvannavong F, Huet B (2005) Hierarchical genetic fusion of possibilities. In: Proceedings of the European workshop on the integration of knowledge. Semantic and Digital Media Technologies
Xue X, Lu H, Wu L et al (2005) Fudan university at trecvid 2005. In: TREC Video Retrieval Workshop, NIST
Liu J, Zhai Y, Basharat A et al (2006) University of central florida at trecvid 2006 high-level feature extraction and video search. In: TREC Video Retrieval Workshop, NIST
Yuan J, Guo Z, Lv L et al (2007) Thu and icrc at trecvid 2007. In: TREC Video Retrieval Workshop, NIST
Tang S, Zhang YD, Li JT et al (2007) Trecvid 2007 high-level feature extraction by mcg-ict-cas. In: Proceedings of the TRECVID, NIST
M. Li, Y. T. Zheng, SX Lin et al (2009) Multimedia evidence fusion for video concept detection via owa operator. In: MMM’09, pp 208–216
Yuan J, Wang H, Xiao L et al (2005) Tsinghua university at trecvid 2005. In: TREC Video Retrieval Workshop, NIST
Cooper M, Adcock J, Chen R et al (2005) Fxpal at trecvid 2005. In: TREC Video Retrieval Workshop, NIST
Naphade MR, Mehrotra R et al (1998) A high performance algorithm for shot boundary detection using multiple cues. In: Proceedings of the IEEE International Conference on Image Processing, pp 884–887
Hadjidemetriou E, Grossberg MD, Nayar SK (2004) Multiresolution histograms and their use for recognition. IEEE Trans Pattern Anal Mach Intell 26:831–847
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput V 60:91–110
Pass G, Zabih R, Miller J (1997) Comparing images using color coherence vectors. In: Proceedings of the fourth ACM international conference on Multimedia, pp 65–73
Huang J, Ravi Kumar S, Mitra M, Zhu W, Zabih R (1999) Spatial color indexing and applications. Int J Comput V 35:245–268
Willamowski J, Arregui D, Csurka G, Dance CR, Fan L Categorizing nine visual classes using local appearance descriptors. illumination, vol 17
Liang Y, Liu X, Wang Z et al (2008) THU and ICRC at trecvid
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05). IEEE Computer Society 1:886–893
Bosch A, Zisserman A, Muoz X (2008) Scene classification using a hybrid generative/discriminative approach. IEEE Trans Pattern Anal Mach Intell 30:712–727
Muller KR, Mika S, Ratsch G et al (2001) An introduction to kernel-based learning algorithms. IEEE trans neural netw 12:181–201
Collobert R, Bengio S (2001) Svmtorch: support vector machines for large-scale regression problems. J Mach Learn Res 1:143–160
Akbani R., Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Proceedings of the 15th European conference on machine learning, pp 39–50
Zhang J, Marszaek M, et al (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vision 73:213–238
Rakotomamonjy A, Bach F et al (2007) More efficiency in multiple kernel learning.In: Proceedings of the 24th international conference on machine learning. ACM, Corvalis, Oregon, pp 775–782
Longworth C, Gales M (2009) Combining derivative and parametric kernels for speaker verification. IEEE Trans Audio Speech Lang Process 17:748–757
Kraaij W, Awad G (2009) TRECVID 2009 High-Level Feature Task: Overview. http://www-nlpir.nist.gov/projects/tvpubs/tv9.slides/tv9.sin.slides.pdf, NIST
Quenot G, Awad G (2010) TRECVID 2010 Semantic Indextion Task. http://www-nlpir.nist.gov/projects/tvpubs/tv10.slides/tv10.hlf.slides.pdf, NIST
Fan RE et al (2009) LIBLINEAR: A library for large linear classification journal of Machine Learning Research, pp 1871–1874
Acknowledgments
This work is sponsored by collaborative Research Project SEV01100474 between Beijing University of Posts and Telecommunications and France Telecom R&D Beijing, and National Natural Science Foundation of China 90920001.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dong, Y., Gao, S., Tao, K. et al. Performance evaluation of early and late fusion methods for generic semantics indexing. Pattern Anal Applic 17, 37–50 (2014). https://doi.org/10.1007/s10044-013-0336-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-013-0336-8