research-article

A support vector approach for cross-modal search of images and texts

Authors:

Yashaswi Verma,

C.V. JawaharAuthors Info & Claims

Volume 154, Issue C

Pages 48 - 63

https://doi.org/10.1016/j.cviu.2016.10.001

Published: 01 January 2017 Publication History

Highlights

•

We propose a novel and generic approach for cross-modal search based on Structural SVM.

•

Our approach provides max-margin guarantees and better generalization than competing methods.

•

We analyze and compare different aspects of our approach such as training and testing time, and performance across datasets.

•

Extensive experiments demonstrate the efficacy of our approach.

Abstract

Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given a query image (“Im2Text”), and (ii) predicting image(s) given a piece of text (“Text2Im”). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an independent text-corpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of unannotated images (i.e., images without any associated textual meta-data).

We propose a novel Structural SVM based unified framework for these two tasks, and show how it can be efficiently trained and tested. Using a variety of loss functions, extensive experiments are conducted on three popular datasets (two medium-scale datasets containing few thousands of samples, and one web-scale dataset containing one million samples). Experiments demonstrate that our framework gives promising results compared to competing baseline cross-modal search techniques, thus confirming its efficacy.

References

[1]

A. Babenko, A. Slesarev, A. Chigorin, V. Lempitsky, Neural codes for image retrieval, in: ECCV, 2014.

[2]

L. Ballan, T. Uricchio, L. Seidenari, A.D. Bimbo, A cross-media model for automatic image annotation, in: ICMR, 2014.

[3]

D. Blei, A. Ng, M. Jordan, Latent Dirichlet allocation, JMLR 12 (1) (2003) 234–278.

[4]

J.C. Caicedo, J. BenAbdallah, F.A. González, O. Nasraoui, Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization, Neurocomput. 76 (1) (2012) 50–60.

[5]

N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines: and Other Kernel-based Learning Methods, Cambridge University Press, 2000.

Digital Library

[6]

R. Datta, D. Joshi, J. Li, J. Wang, Image retrieval:ideas, influences and trends of new age, ACM Comput. Surv. 40 (2) (2008) 1–60.

[7]

J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, DeCAF: a deep convolutional activation feature for generic visual recognition, 2014.

[8]

K. Duan, D.J. Crandall, D. Batra, Multimodal learning in loosely-organized web images, CVPR, 2014.

[9]

P. Duygulu, K. Barnard, J.F.G. de Freitas, D.A. Forsyth, Object recognition as machine translation: learning a lexicon for a fixed image vocabulary, ECCV, 2002.

[10]

H.J. Escalante, C.A. Hérnadez, L.E. Sucar, M. Montes, Late fusion of heterogeneous methods for multimedia image retrieval, MIR, 2008.

[11]

H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. Platt, L. Zitnick, G. Zweig, From captions to visual concepts and back, CVPR, 2015.

[12]

A. Farhadi, M. Hejrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. Forsyth, Every picture tells a story: generating sentences for images, ECCV, 2010.

[13]

S.L. Feng, R. Manmatha, V. Lavrenko, Multiple Bernoulli relevance models for image and video annotation, CVPR, 2004.

[14]

R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, 2014.

[15]

Y. Gong, Q. Ke, M. Isard, S. Lazebnik, A multi-view embedding space for modeling internet images, tags, and their semantics, IJCV 106 (2) (2013) 210–233.

[16]

M. Grubinger, P.D. Clough, H. Müller, T. Deselaers, The IAPR benchmark: a new evaluation resource for visual information systems, International Conference on Language Resources and Evaluation, 2006.

[17]

M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid, TagProp: discriminative metric learning in nearest neighbour models for image auto-annotation, ICCV, 2009.

[18]

A. Gupta, Y. Verma, C.V. Jawahar, Choosing linguistics over vision to describe images, in: AAAI, 2012.

[19]

D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Comput. 16 (12) (2004) 2639–2664.

Digital Library

[20]

M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: data, models and evaluation metrics, JAIR 47 (2013) 853–899.

[21]

H. Hotelling, Relations between two sets of variates, Biometrika 28 (1936) 321–377.

[22]

H. Hu, G.-T. Zhou, Z. Deng, Z.L. andGreg Mori, Learning structured inference neural networks with label relations, in: CVPR, 2016.

[23]

S.J. Hwang, K. Grauman, Learning the relative importance of objects from tagged images for retrieval and cross-modal search, Int. J. Comput. Vis. 100 (2) (2012) 134–153.

[24]

J. Johnson, L. Ballan, L. Fei-Fei, Love thy neighbors: image annotation by exploiting image metadata, in: ICCV, 2015.

[25]

C. Kang, S. Xiang, S. Liao, C. Xu, C. Pan, Learning consistent feature representation for cross-modal multimedia retrieval, IEEE Transactions on Multimedia 17 (3) (2015) 370–381.

Digital Library

[26]

A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, in: CVPR, 2015.

[27]

A. Karpathy, A. Joulin, L. Fei-Fei, Deep fragment embeddings for bidirectional image sentence mapping, in: NIPS, 2014.

[28]

R. Kiros, R. Salakhutdinov, R.S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, in: TACL (2015).

[29]

G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg, T.L. Berg, Baby Talk: understanding and generating simple image descriptions, in: CVPR, 2011.

[30]

P. Kuznetsova, V. Ordonez, A.C. Berg, T.L. Berg, Y. Choi, Collective generation of natural image descriptions, in: ACL, 2012.

[31]

C.-Y. Lin, E. Hovy, Automatic evaluation of summaries using n-gram co-occurrence statistics, in: NAACLHLT, 2003.

[32]

D.G. Lowe, Distinctive image features from scale-invariant keypoints, IJCV 60 (2) (2004) 91–110.

Digital Library

[33]

A. Makadia, V. Pavlovic, S. Kumar, Baselines for image annotation, Int. J. Comput. Vis. 90 (1) (2010) 88–105.

[34]

C.D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S.J. Bethard, D. McClosky, The Stanford CoreNLP natural language processing toolkit, ACL: System Demonstrations, 2014.

[35]

J. Mao, W. Xu, Y. Yang, J. Wang, A.L. Yuille, Explain images with multimodal recurrent neural networks, in: NIPS Deep Learning Workshop, 2014.

[36]

M.-C. de Marneffe, C.D. Manning, The Stanford typed dependencies representation, COLING Workshop on Cross-framework and Cross-domain Parser Evaluation, 2008.

[37]

J.J. McAuley, J. Leskovec, Image labeling on a network: using social-network metadata for image classification, in: ECCV, 2012.

[38]

C. Meadow, B. Boyce, D. Kraft, C. Barry, Text Information Retrieval Systems, Emerald Group Pub Ltd, 2007.

Digital Library

[39]

T. Mei, Y. Rui, S. Li, Q. Tian, Multimedia search reranking: a literature survey, ACM Comput. Surv. 46 (3) (2014) 38:1–38:38.

[40]

A.K. Menon, D. Surian, S. Chawla, Cross-modal retrieval: a pairwise classification approach, SIAM International Conference on Data Mining, 2015.

[41]

T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: NIPS, 2013.

[42]

Z. Niu, G. Hua, X. Gao, Q. Tian, Semi-supervised relational topic model for weakly annotated image recognition in social media, in: CVPR, 2014.

[43]

V. Ordonez, G. Kulkarni, T.L. Berg, Im2Text: describing images using 1 million captioned photographs, in: NIPS, 2011.

[44]

K. Papineni, S. Roukos, T. Ward, W. Zhu, BLEU: a method for automatic evaluation of machine translation, in: ACL, 2002.

[45]

K. Papineni, S. Roukos, T. Ward, W. Zhu, Language models for image captioning: the quirks and what works, in: ACL, 2015.

[46]

M. Paramita, M. Sanderson, P. Clough, Diversity in photo retrieval: overview of the imageCLEFPhoto task 2009, in: CLEF working notes (2009).

[47]

B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, S. Lazebnik, Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models, in: ICCV, 2015.

[48]

C. Rashtchian, P. Young, M. Hodosh, J. Hockenmaier, Collecting image annotation using Amazon’s mechanical turk, NAACLHLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, 2010.

[49]

N. Rasiwasia, D. Mahajan, V. Mahadevan, G. Aggarwal, Cluster canonical correlation analysis, in: AISTATS, 2014.

[50]

N. Rasiwasia, J.C. Pereira, E. Coviello, G. Doyle, G.R.G. Lanckriet, R. Levy, N. Vasconcelos, A new approach to cross-modal multimedia retrieval, in: ACM MM, 2010.

[51]

J. Rodriguez, F. Perronnin, Label embedding for text recognition, in: BMVC, 2013.

[52]

R. Rosipal, N. Krämer, Overview and recent advances in partial least squares, in: SLSFS, 2006.

[53]

M.A. Sadeghi, A. Farhadi, Recognition using visual phrases, in: CVPR, 2011.

[54]

A. Sharma, A. Kumar, H.D. III, D.W. Jacobs, Generalized multiview analysis: a discriminative latent space, in: CVPR, 2012.

[55]

A.F. Smeaton, P. Over, W. Kraaij, Evaluation campaigns and trecvid, MIR: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, 2006.

[56]

A. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image retrieval at the end of the early years, PAMI 22 (12) (2000) 1349–1380.

[57]

R. Socher, A. Karpathy, Q.V. Le, C.D. Manning, A.Y. Ng, Grounded compositional semantics for finding and describing images with sentences, TACL 2 (2013) 207–218.

[58]

N. Srivastava, R. Salakhutdinov, Multimodal learning with deep Boltzmann machines, in: NIPS, 2012.

[59]

J.B. Tenenbaum, W.T. Freeman, Separating style and content with bilinear models, Neural Comput. 12 (6) (2000) 1247–1283.

Digital Library

[60]

I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun, Support vector machine learning for interdependent and structured output spaces, in: ICML, 2004.

[61]

Y. Ushiku, T. Harada, Y. Kuniyoshi, Understanding images with natural sentences, in: ACM MM, 2011.

[62]

Y. Verma, A. Gupta, P. Mannem, C.V. Jawahar, Generating image descriptions using semantic similarities in the output space, in: CVPR Workshop, 2013.

[63]

Y. Verma, C.V. Jawahar, Image annotation using metric learning in semantic neighbourhoods, in: ECCV, 2012.

[64]

Y. Verma, C.V. Jawahar, Exploring SVM for image annotation in presence of confusing labels, in: BMVC, 2013.

[65]

Y. Verma, C.V. Jawahar, Im2Text and Text2Im: associating images and texts for cross-modal retrieval, in: BMVC, 2014.

[66]

K. Wang, R. He, W. Wang, L. Wang, T. Tan, Learning coupled feature spaces for cross-modal matching, in: ICCV, 2013.

[67]

J. Weston, S. Bengio, N. Usunier, WSABIE: scaling up to large vocabulary image annotation, in: IJCAI, 2011.

[68]

Y. Yang, C.L. Teo, H.D. III, Y. Aloimonos, Corpus-guided sentence generation of natural images, in: EMNLP, 2011.

[69]

M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: ECCV, 2014.

Cited By

Xian YTian Y(2019)Self-Guiding Multimodal LSTM—When We Do Not Have a Perfect Training Dataset for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2019.291722928:11(5241-5252)Online publication date: 21-Aug-2019
https://dl.acm.org/doi/10.1109/TIP.2019.2917229
Li ZLu WSun ZXing W(2018)Improving multi-label classification using scene cuesMultimedia Tools and Applications10.1007/s11042-017-4517-077:5(6079-6094)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1007/s11042-017-4517-0

Index Terms

A support vector approach for cross-modal search of images and texts
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Index terms have been assigned to the content through auto-classification.

Recommendations

Click-through-based cross-view learning for image search
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

One of the fundamental problems in image search is to rank image documents according to a given textual query. Existing search engines highly depend on surrounding texts for ranking images, or leverage the query-image pairs annotated by human labelers ...
Why People Search for Images using Web Search Engines
WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

What are the intents or goals behind human interactions with image search engines? Knowing why people search for images is of major concern to Web image search engines because user satisfaction may vary as intent varies. Previous analyses of image ...
Click-boosting multi-modality graph-based reranking for image search

Image reranking is an effective way for improving the retrieval performance of keyword-based image search engines. A fundamental issue underlying the success of existing image reranking approaches is the ability in identifying potentially useful ...

Comments

Information & Contributors

Information

Published In

cover image Computer Vision and Image Understanding

Computer Vision and Image Understanding Volume 154, Issue C

Jan 2017

206 pages

ISSN:1077-3142

Issue’s Table of Contents

Copyright © 2016.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 January 2017

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xian YTian Y(2019)Self-Guiding Multimodal LSTM—When We Do Not Have a Perfect Training Dataset for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2019.291722928:11(5241-5252)Online publication date: 21-Aug-2019
https://dl.acm.org/doi/10.1109/TIP.2019.2917229
Li ZLu WSun ZXing W(2018)Improving multi-label classification using scene cuesMultimedia Tools and Applications10.1007/s11042-017-4517-077:5(6079-6094)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1007/s11042-017-4517-0

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents