Abstract
With the ongoing development of the internet, a large number of multimedia documents containing images and texts have appeared in the daily life of people. Therefore, how to effectively and efficiently conduct cross-modal and multi-modal retrieval is being an important issue. Although some methods have been proposed to deal with the issue, their retrieval processes are confined to a single information source of multimedia documents, such as the representations of images and texts at a semantic level. In this paper, we propose a novel probabilistic model, namely CCSS, which not only combines low-level content and high-level semantics similarities through a first-order Markov chain, but also provides heterogeneous similarity measures for different unimedia types. The ranked list for a query is obtained by highlighting an optimal path across the chain. Content similarity focuses on the internal structure of each modality, while semantics similarity focuses on the semantic correlation between different modalities. Both of them are significant and their combination can be complementary to each other. Multi-class logistic regression and random forests are used to map the original features of each unimedia into a semantic space. According to the query-by-example scenario, the experiments on the Wikipedia dataset show that the performance of our model significantly outperforms those of state-of-the-art approaches for cross-modal retrieval. Additionally, the proposed multi-modal method is also shown to outperform previous systems on image retrieval task.







Similar content being viewed by others
Notes
Semantics similarity means metric similarity rather than the similarity among concept labels.
References
Atrey PK, Hossain MA, EI Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multime’d Syst 16(6):345–379
Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Carneiro G, Chan A, Moreno P, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410
Clinchant S, Ah-Pine J, Csurka G (2011) Semantic combination of textual and visual information in multimedia retrieval. ACM Int Conf Multimed Retr
Coviello E, Mumtaz A, Chan A, Lanckriet G (2012) Growing a bag of systems tree for fast and accurate classification. IEEE Int Conf Comput Vis Pattern Recognit (CVPR)
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. Workshop Stat Learn Comput Vis ECCV 1:22, Citeseer
Forney G (1973) The Viterbi algorithm. Proc IEEE 61(3):268–278
Haubold A, Natsev A, Naphade MR (2006) Semantic multimedia retrieval using lexical query expansion and model-based reranking. IEEE Int Conf Multimed Expo (ICME)
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22th Annual International SIGIR Conference
Hotelling H (1936) Relations between two sets of variates. Biometrika 28(3–4):321–337
Jia Y, Salzmann M, Darrell T (2011) Learning Cross-modality Similarity for Multinomial Data. IEEE Int Conf Comput Vis (ICCV)
Jolliffe IT (2002) Principal component analysis. Springer
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. Proc ACM Int Conf Multimed
Lmura J, Fujisawa T, Harada T, Kuniyoshi Y (2011) Efficient multi-modal retrieval in conceptual space. ACM Int Conf Multimed
Logan B, Salomon A (2001) A music similarity function based on signal analysis. IEEE Int Conf Multimed Expo (ICME)
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Manning CD, Ranghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Miotto R, Orio N (2012) A Probabilistic Model to Combine Tags and Acoustic Similarity for Music Retrieval. ACM Trans Inf Syst 30: No. 2, Article 8
Rabiner L (1989) A tutorial on hidden Markov models and selected application in speech recognition. Proc IEEE 77(2):257–286
Rasiwasia N, Moreno P, Vasconcelos N (2007) Bridging the gap: query by semantic example. IEEE Trans Multime’d 9(5):923–938
Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet G, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. Proc ACM Int Conf Multimed
Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380
Snoek CG, Worring M (2005) Multimodal video indexing: a review of the state-of-the-art. Multimed Tools Appl 25(1):5–35
Turnbull D, Barrington L, Torres D, Lanckriet G (2008) Semantic annotation and retrieval of music and sound effects. IEEE Trans Audio Speech Lang Process 16(2):467–476
Vasconcelos N (2004) Minimum probability of error image retrieval. IEEE Trans Signal Process 52(8):2322–2336
Vía J, Santamaía I, Pérez J (2005) Canonical correlation analysis (CCA) algorithms for multiple data sets: Application to blind SIMO equalization. In proceedings of the 13th European Signal Processing Conference (EUSIPCO)
Vinokourov A, Hardoon DR, Shawe-Taylor J (2003) Learning the semantics of multimedia content with application to web image retrieval and classification. In: International symposium on Independent Component Analysis and Blind Source Separation
Westerveld T, De Vries AP, van Ballegooij A, de Jong F, Hiemstra D (2003) A probabilistic multimedia retrieval model and its evaluation. EURASIP J Appl Signal Process 2:186–198
Xie L, Pan P, Lu Y (2013) A semantic model for cross-modal and multi-modal retrieval. ACM Int Conf Multimed Retr
Yang Y, Xu D, Nie F, Luo J, Zhuang Y (2009) Ranking with local regression and global alignment for cross media retrieval. ACM Int Conf Multimed
Zhai X, Peng Y, Xiao J (2013) Cross-media retrieval by intra-media and inter-media correlation mining. Multime’d. Syst 19(5):395–406
Zhai X, Peng Y, Xiao J (2012) Cross-modality correlation propagation for cross-media retrieval. Proc ICASSP
Zhai X, Peng Y, Xiao J (2012) Effective heterogeneous similarity measure with nearest neighbors for cross-media retrieval. Int Conf MultiMed Model (MMM)
Zhen Y, Yeung D (2012) A probabilistic model for multimodal hash function learning. Proc ACM KDD
Zhen Y, Yeung D (2012) Co-regularized hashing for multimodal data. Adv Neural Inf Process Syst (NIPS)
Zhen Y, Yeung D (2013) Active hashing and its application to image and text retrieval. Data Min Knowl Disc 26(2):255–274
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, S., Pan, P., Lu, Y. et al. Improving cross-modal and multi-modal retrieval combining content and semantics similarities with probabilistic model. Multimed Tools Appl 74, 2009–2032 (2015). https://doi.org/10.1007/s11042-013-1737-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1737-9