Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A support vector approach for cross-modal search of images and texts

Published: 01 January 2017 Publication History

Highlights

We propose a novel and generic approach for cross-modal search based on Structural SVM.
Our approach provides max-margin guarantees and better generalization than competing methods.
We analyze and compare different aspects of our approach such as training and testing time, and performance across datasets.
Extensive experiments demonstrate the efficacy of our approach.

Abstract

Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given a query image (“Im2Text”), and (ii) predicting image(s) given a piece of text (“Text2Im”). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an independent text-corpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of unannotated images (i.e., images without any associated textual meta-data).
We propose a novel Structural SVM based unified framework for these two tasks, and show how it can be efficiently trained and tested. Using a variety of loss functions, extensive experiments are conducted on three popular datasets (two medium-scale datasets containing few thousands of samples, and one web-scale dataset containing one million samples). Experiments demonstrate that our framework gives promising results compared to competing baseline cross-modal search techniques, thus confirming its efficacy.

References

[1]
A. Babenko, A. Slesarev, A. Chigorin, V. Lempitsky, Neural codes for image retrieval, in: ECCV, 2014.
[2]
L. Ballan, T. Uricchio, L. Seidenari, A.D. Bimbo, A cross-media model for automatic image annotation, in: ICMR, 2014.
[3]
D. Blei, A. Ng, M. Jordan, Latent Dirichlet allocation, JMLR 12 (1) (2003) 234–278.
[4]
J.C. Caicedo, J. BenAbdallah, F.A. González, O. Nasraoui, Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization, Neurocomput. 76 (1) (2012) 50–60.
[5]
N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines: and Other Kernel-based Learning Methods, Cambridge University Press, 2000.
[6]
R. Datta, D. Joshi, J. Li, J. Wang, Image retrieval:ideas, influences and trends of new age, ACM Comput. Surv. 40 (2) (2008) 1–60.
[7]
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, DeCAF: a deep convolutional activation feature for generic visual recognition, 2014.
[8]
K. Duan, D.J. Crandall, D. Batra, Multimodal learning in loosely-organized web images, CVPR, 2014.
[9]
P. Duygulu, K. Barnard, J.F.G. de Freitas, D.A. Forsyth, Object recognition as machine translation: learning a lexicon for a fixed image vocabulary, ECCV, 2002.
[10]
H.J. Escalante, C.A. Hérnadez, L.E. Sucar, M. Montes, Late fusion of heterogeneous methods for multimedia image retrieval, MIR, 2008.
[11]
H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. Platt, L. Zitnick, G. Zweig, From captions to visual concepts and back, CVPR, 2015.
[12]
A. Farhadi, M. Hejrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. Forsyth, Every picture tells a story: generating sentences for images, ECCV, 2010.
[13]
S.L. Feng, R. Manmatha, V. Lavrenko, Multiple Bernoulli relevance models for image and video annotation, CVPR, 2004.
[14]
R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, 2014.
[15]
Y. Gong, Q. Ke, M. Isard, S. Lazebnik, A multi-view embedding space for modeling internet images, tags, and their semantics, IJCV 106 (2) (2013) 210–233.
[16]
M. Grubinger, P.D. Clough, H. Müller, T. Deselaers, The IAPR benchmark: a new evaluation resource for visual information systems, International Conference on Language Resources and Evaluation, 2006.
[17]
M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid, TagProp: discriminative metric learning in nearest neighbour models for image auto-annotation, ICCV, 2009.
[18]
A. Gupta, Y. Verma, C.V. Jawahar, Choosing linguistics over vision to describe images, in: AAAI, 2012.
[19]
D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Comput. 16 (12) (2004) 2639–2664.
[20]
M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: data, models and evaluation metrics, JAIR 47 (2013) 853–899.
[21]
H. Hotelling, Relations between two sets of variates, Biometrika 28 (1936) 321–377.
[22]
H. Hu, G.-T. Zhou, Z. Deng, Z.L. andGreg Mori, Learning structured inference neural networks with label relations, in: CVPR, 2016.
[23]
S.J. Hwang, K. Grauman, Learning the relative importance of objects from tagged images for retrieval and cross-modal search, Int. J. Comput. Vis. 100 (2) (2012) 134–153.
[24]
J. Johnson, L. Ballan, L. Fei-Fei, Love thy neighbors: image annotation by exploiting image metadata, in: ICCV, 2015.
[25]
C. Kang, S. Xiang, S. Liao, C. Xu, C. Pan, Learning consistent feature representation for cross-modal multimedia retrieval, IEEE Transactions on Multimedia 17 (3) (2015) 370–381.
[26]
A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, in: CVPR, 2015.
[27]
A. Karpathy, A. Joulin, L. Fei-Fei, Deep fragment embeddings for bidirectional image sentence mapping, in: NIPS, 2014.
[28]
R. Kiros, R. Salakhutdinov, R.S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, in: TACL (2015).
[29]
G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg, T.L. Berg, Baby Talk: understanding and generating simple image descriptions, in: CVPR, 2011.
[30]
P. Kuznetsova, V. Ordonez, A.C. Berg, T.L. Berg, Y. Choi, Collective generation of natural image descriptions, in: ACL, 2012.
[31]
C.-Y. Lin, E. Hovy, Automatic evaluation of summaries using n-gram co-occurrence statistics, in: NAACLHLT, 2003.
[32]
D.G. Lowe, Distinctive image features from scale-invariant keypoints, IJCV 60 (2) (2004) 91–110.
[33]
A. Makadia, V. Pavlovic, S. Kumar, Baselines for image annotation, Int. J. Comput. Vis. 90 (1) (2010) 88–105.
[34]
C.D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S.J. Bethard, D. McClosky, The Stanford CoreNLP natural language processing toolkit, ACL: System Demonstrations, 2014.
[35]
J. Mao, W. Xu, Y. Yang, J. Wang, A.L. Yuille, Explain images with multimodal recurrent neural networks, in: NIPS Deep Learning Workshop, 2014.
[36]
M.-C. de Marneffe, C.D. Manning, The Stanford typed dependencies representation, COLING Workshop on Cross-framework and Cross-domain Parser Evaluation, 2008.
[37]
J.J. McAuley, J. Leskovec, Image labeling on a network: using social-network metadata for image classification, in: ECCV, 2012.
[38]
C. Meadow, B. Boyce, D. Kraft, C. Barry, Text Information Retrieval Systems, Emerald Group Pub Ltd, 2007.
[39]
T. Mei, Y. Rui, S. Li, Q. Tian, Multimedia search reranking: a literature survey, ACM Comput. Surv. 46 (3) (2014) 38:1–38:38.
[40]
A.K. Menon, D. Surian, S. Chawla, Cross-modal retrieval: a pairwise classification approach, SIAM International Conference on Data Mining, 2015.
[41]
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: NIPS, 2013.
[42]
Z. Niu, G. Hua, X. Gao, Q. Tian, Semi-supervised relational topic model for weakly annotated image recognition in social media, in: CVPR, 2014.
[43]
V. Ordonez, G. Kulkarni, T.L. Berg, Im2Text: describing images using 1 million captioned photographs, in: NIPS, 2011.
[44]
K. Papineni, S. Roukos, T. Ward, W. Zhu, BLEU: a method for automatic evaluation of machine translation, in: ACL, 2002.
[45]
K. Papineni, S. Roukos, T. Ward, W. Zhu, Language models for image captioning: the quirks and what works, in: ACL, 2015.
[46]
M. Paramita, M. Sanderson, P. Clough, Diversity in photo retrieval: overview of the imageCLEFPhoto task 2009, in: CLEF working notes (2009).
[47]
B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, S. Lazebnik, Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models, in: ICCV, 2015.
[48]
C. Rashtchian, P. Young, M. Hodosh, J. Hockenmaier, Collecting image annotation using Amazon’s mechanical turk, NAACLHLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, 2010.
[49]
N. Rasiwasia, D. Mahajan, V. Mahadevan, G. Aggarwal, Cluster canonical correlation analysis, in: AISTATS, 2014.
[50]
N. Rasiwasia, J.C. Pereira, E. Coviello, G. Doyle, G.R.G. Lanckriet, R. Levy, N. Vasconcelos, A new approach to cross-modal multimedia retrieval, in: ACM MM, 2010.
[51]
J. Rodriguez, F. Perronnin, Label embedding for text recognition, in: BMVC, 2013.
[52]
R. Rosipal, N. Krämer, Overview and recent advances in partial least squares, in: SLSFS, 2006.
[53]
M.A. Sadeghi, A. Farhadi, Recognition using visual phrases, in: CVPR, 2011.
[54]
A. Sharma, A. Kumar, H.D. III, D.W. Jacobs, Generalized multiview analysis: a discriminative latent space, in: CVPR, 2012.
[55]
A.F. Smeaton, P. Over, W. Kraaij, Evaluation campaigns and trecvid, MIR: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, 2006.
[56]
A. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image retrieval at the end of the early years, PAMI 22 (12) (2000) 1349–1380.
[57]
R. Socher, A. Karpathy, Q.V. Le, C.D. Manning, A.Y. Ng, Grounded compositional semantics for finding and describing images with sentences, TACL 2 (2013) 207–218.
[58]
N. Srivastava, R. Salakhutdinov, Multimodal learning with deep Boltzmann machines, in: NIPS, 2012.
[59]
J.B. Tenenbaum, W.T. Freeman, Separating style and content with bilinear models, Neural Comput. 12 (6) (2000) 1247–1283.
[60]
I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun, Support vector machine learning for interdependent and structured output spaces, in: ICML, 2004.
[61]
Y. Ushiku, T. Harada, Y. Kuniyoshi, Understanding images with natural sentences, in: ACM MM, 2011.
[62]
Y. Verma, A. Gupta, P. Mannem, C.V. Jawahar, Generating image descriptions using semantic similarities in the output space, in: CVPR Workshop, 2013.
[63]
Y. Verma, C.V. Jawahar, Image annotation using metric learning in semantic neighbourhoods, in: ECCV, 2012.
[64]
Y. Verma, C.V. Jawahar, Exploring SVM for image annotation in presence of confusing labels, in: BMVC, 2013.
[65]
Y. Verma, C.V. Jawahar, Im2Text and Text2Im: associating images and texts for cross-modal retrieval, in: BMVC, 2014.
[66]
K. Wang, R. He, W. Wang, L. Wang, T. Tan, Learning coupled feature spaces for cross-modal matching, in: ICCV, 2013.
[67]
J. Weston, S. Bengio, N. Usunier, WSABIE: scaling up to large vocabulary image annotation, in: IJCAI, 2011.
[68]
Y. Yang, C.L. Teo, H.D. III, Y. Aloimonos, Corpus-guided sentence generation of natural images, in: EMNLP, 2011.
[69]
M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: ECCV, 2014.

Cited By

View all
  • (2019)Self-Guiding Multimodal LSTM—When We Do Not Have a Perfect Training Dataset for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2019.291722928:11(5241-5252)Online publication date: 21-Aug-2019
  • (2018)Improving multi-label classification using scene cuesMultimedia Tools and Applications10.1007/s11042-017-4517-077:5(6079-6094)Online publication date: 1-Mar-2018

Index Terms

  1. A support vector approach for cross-modal search of images and texts
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Computer Vision and Image Understanding
        Computer Vision and Image Understanding  Volume 154, Issue C
        Jan 2017
        206 pages

        Publisher

        Elsevier Science Inc.

        United States

        Publication History

        Published: 01 January 2017

        Author Tags

        1. Image search
        2. Image description
        3. Cross-media analysis

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 23 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2019)Self-Guiding Multimodal LSTM—When We Do Not Have a Perfect Training Dataset for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2019.291722928:11(5241-5252)Online publication date: 21-Aug-2019
        • (2018)Improving multi-label classification using scene cuesMultimedia Tools and Applications10.1007/s11042-017-4517-077:5(6079-6094)Online publication date: 1-Mar-2018

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media