research-article

Cross-modal Retrieval with Correspondence Autoencoder

Authors:

Fangxiang Feng,

Ruifan LiAuthors Info & Claims

MM '14: Proceedings of the 22nd ACM international conference on Multimedia

Pages 7 - 16

https://doi.org/10.1145/2647868.2654902

Published: 03 November 2014 Publication History

Abstract

The problem of cross-modal retrieval, e.g., using a text query to search for images and vice-versa, is considered in this paper. A novel model involving correspondence autoencoder (Corr-AE) is proposed here for solving this problem. The model is constructed by correlating hidden representations of two uni-modal autoencoders. A novel optimal objective, which minimizes a linear combination of representation learning errors for each modality and correlation learning error between hidden representations of two modalities, is used to train the model as a whole. Minimization of correlation learning error forces the model to learn hidden representations with only common information in different modalities, while minimization of representation learning error makes hidden representations are good enough to reconstruct input of each modality. A parameter $\alpha$ is used to balance the representation learning error and the correlation learning error. Based on two different multi-modal autoencoders, Corr-AE is extended to other two correspondence models, here we called Corr-Cross-AE and Corr-Full-AE. The proposed models are evaluated on three publicly available data sets from real scenes. We demonstrate that the three correspondence autoencoders perform significantly better than three canonical correlation analysis based models and two popular multi-modal deep models on cross-modal retrieval tasks.

References

[1]

M. Bastan, H. Cam, U. Glzdlzkbay, and z. Ulusoy. Bilvideo-7: an mpeg-7- compatible video indexing and retrieval system. IEEE MultiMedia, 17(3):62--73, 2010.

Digital Library

[2]

Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1--127, 2009.

Digital Library

[3]

E. L. Bird, Steven and E. Klein. In Natural Language Processing with Python. O'Reilly Media Inc., 2009.

Digital Library

[4]

D. M. Blei and M. I. Jordan. Modeling annotated data. ACM SIGIR, pages 127--134, 2003.

Digital Library

[5]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.

Digital Library

[6]

A. Bosch, A. Zisserman, and X. Muñoz. Image classification using random forests and ferns. In ICCV, pages 1--8. IEEE, 2007.

[7]

T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. Nus-wide: A real-world web image database from national university of singapore. In CIVR, Santorini, Greece, 2009.

Digital Library

[8]

A. Farhadi, S. M. M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. A. Forsyth. Every picture tells a story: Generating sentences from images. In K. Daniilidis, P. Maragos, and N. Paragios, editors, ECCV (4), volume 6314 of Lecture Notes in Computer Science, pages 15--29. Springer, 2010.

Digital Library

[9]

A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121--2129, 2013.

Digital Library

[10]

D. R. Hardoon, S. Szedmlck, and J. Shawe-Taylor. Canonical correlation analysis; an overview with application to learning methods. Neural Computation, 16:2639--2664, 2004.

Digital Library

[11]

G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.

[12]

G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771--1800, 2002.

Digital Library

[13]

G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527--1554, 2006.

Digital Library

[14]

Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality similarity for multinomial data. ICCV, pages 2407--2414, 2011.

Digital Library

[15]

J. Kim, J. Nam, and I. Gurevych. Learning semantics with deep belief network for cross-language information retrieval. COLING, pages 579--588, 2012.

[16]

B. S. Manjunath, J. R. Ohm, V. V. Vinod, and A. Yamada. Color and texture descriptors. IEEE Trans. Circuits and Systems for Video Technology, Special Issue on MPEG-7, 11(6):703--715, June 2001.

Digital Library

[17]

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. ICML, pages 689--696, 2011.

Digital Library

[18]

A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3):145--175, 2001.

Digital Library

[19]

N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. ACM MM, pages 251--260, 2010.

Digital Library

[20]

R. Salakhutdinov and G. Hinton. Replicated softmax: an undirected topic model. NIPS, pages 1607--1614, 2009.

Digital Library

[21]

R. R. Salakhutdinov and G. G. Hinton. An efficient learning procedure for deep Boltzmann machines. Neural computation, 24(8):1967--2006, 2012.

Digital Library

[22]

P. Smolensky. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. chapter Information processing in dynamical systems: foundations of harmony theory, pages 194--281. MIT Press, Cambridge, MA, USA, 1986.

Digital Library

[23]

R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In NIPS, pages 935--943, 2013.

Digital Library

[24]

N. Srivastava and R. Salakhutdinov. Learning representations for multimodal data with deep belief nets. ICML Representation Learning Workshop, 2012.

[25]

N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. NIPS, pages 2231--2239, 2012.

[26]

L. van der Maaten and G. Hinton. Visualizing high-dimensional data using t-sne. JMLR, 2008.

[27]

A. Vedaldi and B. Fulkerson. Vlfeat: an open and portable library of computer vision algorithms. In A. D. Bimbo, S.-F. Chang, and A. W. M. Smeulders, editors, ACM MM, pages 1469--1472, 2010.

Digital Library

[28]

M. Welling, M. Rosen-Zvi, and G. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, pages 501--508, Vancouver, 2004. Morgan Kaufmann.

[29]

J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: Learning to rank with joint word-image embeddings. In ECML, pages 21--35, 2010.

Digital Library

[30]

Y. Zhuang, Y. F. Wang, F. Wu, Y. Zhang, and W. Lu. Supervised coupled dictionary learning with group structures for multi-modal retrieval. In AAAI, 2013.

Digital Library

Cited By

Wang JGong TYan YHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Semi-supervised Prototype Semantic Association Learning for Robust Cross-modal RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657756(872-881)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657756
AKGUL AKARACA YPALA MÇIMEN MBOZ AYILDIZ M(2024)CHAOS THEORY, ADVANCED METAHEURISTIC ALGORITHMS AND THEIR NEWFANGLED DEEP LEARNING ARCHITECTURE OPTIMIZATION APPLICATIONS: A REVIEWFractals10.1142/S0218348X2430001032:03Online publication date: 5-Apr-2024
https://doi.org/10.1142/S0218348X24300010
Feng YJi SLiu YDu SDai QGao Y(2024)Hypergraph-Based Multi-Modal Representation for Open-Set 3D Object RetrievalIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333276846:4(2206-2223)Online publication date: Apr-2024
https://doi.org/10.1109/TPAMI.2023.3332768
Show More Cited By

Index Terms

Cross-modal Retrieval with Correspondence Autoencoder
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval

Recommendations

A new approach to cross-modal multimedia retrieval
MM '10: Proceedings of the 18th ACM international conference on Multimedia

The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of ...
Correspondence Autoencoders for Cross-Modal Retrieval
Special Issue on Smartphone-Based Interactive Technologies, Systems, and Applications and Special Issue on Extended Best Papers from ACM Multimedia 2014

This article considers the problem of cross-modal retrieval, such as using a text query to search for images and vice-versa. Based on different autoencoders, several novel models are proposed here for solving this problem. These models are constructed ...
Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Common approaches to problems involving multiple modalities (classification, retrieval, hyperlinking, etc.) are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '14: Proceedings of the 22nd ACM international conference on Multimedia

November 2014

1310 pages

ISBN:9781450330633

DOI:10.1145/2647868

General Chairs:
Kien A. Hua
University of Central Florida, USA
,
Yong Rui
Microsoft Research, China
,
Ralf Steinmetz
Technische Universitt Darmstadt, Germany
,
Program Chairs:
Alan Hanjalic
Delft University of Technology, Netherlands
,
Apostol (Paul) Natsev
Google, USA
,
Wenwu Zhu
Tsinghua University, China

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Discipline building plan in 111 base of China
Fundamental Research Funds for the Central Universities of China
Ministry of Science and Technology of the People's Republic of China
Beijing University of Posts and Telecommunications

Conference

MM '14

Sponsor:

SIGMM

MM '14: 2014 ACM Multimedia Conference

November 3 - 7, 2014

Florida, Orlando, USA

Acceptance Rates

MM '14 Paper Acceptance Rate 55 of 286 submissions, 19%;

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

435
Total Citations
View Citations
3,537
Total Downloads

Downloads (Last 12 months)164
Downloads (Last 6 weeks)16

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang JGong TYan YHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Semi-supervised Prototype Semantic Association Learning for Robust Cross-modal RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657756(872-881)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657756
AKGUL AKARACA YPALA MÇIMEN MBOZ AYILDIZ M(2024)CHAOS THEORY, ADVANCED METAHEURISTIC ALGORITHMS AND THEIR NEWFANGLED DEEP LEARNING ARCHITECTURE OPTIMIZATION APPLICATIONS: A REVIEWFractals10.1142/S0218348X2430001032:03Online publication date: 5-Apr-2024
https://doi.org/10.1142/S0218348X24300010
Feng YJi SLiu YDu SDai QGao Y(2024)Hypergraph-Based Multi-Modal Representation for Open-Set 3D Object RetrievalIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333276846:4(2206-2223)Online publication date: Apr-2024
https://doi.org/10.1109/TPAMI.2023.3332768
Ma WChen QLiu FZhou TCai Z(2024)Query-Adaptive Late Fusion for Hierarchical Fine-Grained Video-Text RetrievalIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.321420835:5(7150-7161)Online publication date: May-2024
https://doi.org/10.1109/TNNLS.2022.3214208
Peng SHe YLiu XCheung YXu XCui Z(2024)Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image–Text RetrievalIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318856935:2(2194-2207)Online publication date: Feb-2024
https://doi.org/10.1109/TNNLS.2022.3188569
Wang SLiu JYu GLiu XZhou SZhu EYang YYin JYang W(2024)Multiview Deep Anomaly Detection: A Systematic ExplorationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318472335:2(1651-1665)Online publication date: Feb-2024
https://doi.org/10.1109/TNNLS.2022.3184723
Zhang LChen LZhou CLi XYang FYi Z(2024)Weighted Graph-Structured Semantics Constraint Network for Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.328289426(1551-1564)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3282894
Zhou WKong LHan YQin JSun Z(2024)Contextualized Relation Predictive Model for Self-Supervised Group Activity Representation LearningIEEE Transactions on Multimedia10.1109/TMM.2023.326528026(353-366)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3265280
Liang MLi YYu YCao XXue ZLi ALu K(2024)Structures Aware Fine-Grained Contrastive Adversarial Hashing for Cross-Media RetrievalIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335625836:7(3514-3528)Online publication date: Jul-2024
https://doi.org/10.1109/TKDE.2024.3356258
Wang ZXu XWei JXie NYang YShen H(2024)Semantics Disentangling for Cross-Modal RetrievalIEEE Transactions on Image Processing10.1109/TIP.2024.337411133(2226-2237)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3374111
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents