Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2647868.2654902acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cross-modal Retrieval with Correspondence Autoencoder

Published: 03 November 2014 Publication History

Abstract

The problem of cross-modal retrieval, e.g., using a text query to search for images and vice-versa, is considered in this paper. A novel model involving correspondence autoencoder (Corr-AE) is proposed here for solving this problem. The model is constructed by correlating hidden representations of two uni-modal autoencoders. A novel optimal objective, which minimizes a linear combination of representation learning errors for each modality and correlation learning error between hidden representations of two modalities, is used to train the model as a whole. Minimization of correlation learning error forces the model to learn hidden representations with only common information in different modalities, while minimization of representation learning error makes hidden representations are good enough to reconstruct input of each modality. A parameter $\alpha$ is used to balance the representation learning error and the correlation learning error. Based on two different multi-modal autoencoders, Corr-AE is extended to other two correspondence models, here we called Corr-Cross-AE and Corr-Full-AE. The proposed models are evaluated on three publicly available data sets from real scenes. We demonstrate that the three correspondence autoencoders perform significantly better than three canonical correlation analysis based models and two popular multi-modal deep models on cross-modal retrieval tasks.

References

[1]
M. Bastan, H. Cam, U. Glzdlzkbay, and z. Ulusoy. Bilvideo-7: an mpeg-7- compatible video indexing and retrieval system. IEEE MultiMedia, 17(3):62--73, 2010.
[2]
Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1--127, 2009.
[3]
E. L. Bird, Steven and E. Klein. In Natural Language Processing with Python. O'Reilly Media Inc., 2009.
[4]
D. M. Blei and M. I. Jordan. Modeling annotated data. ACM SIGIR, pages 127--134, 2003.
[5]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.
[6]
A. Bosch, A. Zisserman, and X. Muñoz. Image classification using random forests and ferns. In ICCV, pages 1--8. IEEE, 2007.
[7]
T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. Nus-wide: A real-world web image database from national university of singapore. In CIVR, Santorini, Greece, 2009.
[8]
A. Farhadi, S. M. M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. A. Forsyth. Every picture tells a story: Generating sentences from images. In K. Daniilidis, P. Maragos, and N. Paragios, editors, ECCV (4), volume 6314 of Lecture Notes in Computer Science, pages 15--29. Springer, 2010.
[9]
A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121--2129, 2013.
[10]
D. R. Hardoon, S. Szedmlck, and J. Shawe-Taylor. Canonical correlation analysis; an overview with application to learning methods. Neural Computation, 16:2639--2664, 2004.
[11]
G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.
[12]
G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771--1800, 2002.
[13]
G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527--1554, 2006.
[14]
Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality similarity for multinomial data. ICCV, pages 2407--2414, 2011.
[15]
J. Kim, J. Nam, and I. Gurevych. Learning semantics with deep belief network for cross-language information retrieval. COLING, pages 579--588, 2012.
[16]
B. S. Manjunath, J. R. Ohm, V. V. Vinod, and A. Yamada. Color and texture descriptors. IEEE Trans. Circuits and Systems for Video Technology, Special Issue on MPEG-7, 11(6):703--715, June 2001.
[17]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. ICML, pages 689--696, 2011.
[18]
A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3):145--175, 2001.
[19]
N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. ACM MM, pages 251--260, 2010.
[20]
R. Salakhutdinov and G. Hinton. Replicated softmax: an undirected topic model. NIPS, pages 1607--1614, 2009.
[21]
R. R. Salakhutdinov and G. G. Hinton. An efficient learning procedure for deep Boltzmann machines. Neural computation, 24(8):1967--2006, 2012.
[22]
P. Smolensky. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. chapter Information processing in dynamical systems: foundations of harmony theory, pages 194--281. MIT Press, Cambridge, MA, USA, 1986.
[23]
R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In NIPS, pages 935--943, 2013.
[24]
N. Srivastava and R. Salakhutdinov. Learning representations for multimodal data with deep belief nets. ICML Representation Learning Workshop, 2012.
[25]
N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. NIPS, pages 2231--2239, 2012.
[26]
L. van der Maaten and G. Hinton. Visualizing high-dimensional data using t-sne. JMLR, 2008.
[27]
A. Vedaldi and B. Fulkerson. Vlfeat: an open and portable library of computer vision algorithms. In A. D. Bimbo, S.-F. Chang, and A. W. M. Smeulders, editors, ACM MM, pages 1469--1472, 2010.
[28]
M. Welling, M. Rosen-Zvi, and G. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, pages 501--508, Vancouver, 2004. Morgan Kaufmann.
[29]
J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: Learning to rank with joint word-image embeddings. In ECML, pages 21--35, 2010.
[30]
Y. Zhuang, Y. F. Wang, F. Wu, Y. Zhang, and W. Lu. Supervised coupled dictionary learning with group structures for multi-modal retrieval. In AAAI, 2013.

Cited By

View all
  • (2024)Semi-supervised Prototype Semantic Association Learning for Robust Cross-modal RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657756(872-881)Online publication date: 10-Jul-2024
  • (2024)CHAOS THEORY, ADVANCED METAHEURISTIC ALGORITHMS AND THEIR NEWFANGLED DEEP LEARNING ARCHITECTURE OPTIMIZATION APPLICATIONS: A REVIEWFractals10.1142/S0218348X2430001032:03Online publication date: 5-Apr-2024
  • (2024)Hypergraph-Based Multi-Modal Representation for Open-Set 3D Object RetrievalIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333276846:4(2206-2223)Online publication date: Apr-2024
  • Show More Cited By

Index Terms

  1. Cross-modal Retrieval with Correspondence Autoencoder

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '14: Proceedings of the 22nd ACM international conference on Multimedia
      November 2014
      1310 pages
      ISBN:9781450330633
      DOI:10.1145/2647868
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 November 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. autoencoder
      2. cross-modal
      3. deep learning
      4. image and text
      5. retrieval

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MM '14
      Sponsor:
      MM '14: 2014 ACM Multimedia Conference
      November 3 - 7, 2014
      Florida, Orlando, USA

      Acceptance Rates

      MM '14 Paper Acceptance Rate 55 of 286 submissions, 19%;
      Overall Acceptance Rate 995 of 4,171 submissions, 24%

      Upcoming Conference

      MM '24
      The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)164
      • Downloads (Last 6 weeks)16
      Reflects downloads up to 12 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Semi-supervised Prototype Semantic Association Learning for Robust Cross-modal RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657756(872-881)Online publication date: 10-Jul-2024
      • (2024)CHAOS THEORY, ADVANCED METAHEURISTIC ALGORITHMS AND THEIR NEWFANGLED DEEP LEARNING ARCHITECTURE OPTIMIZATION APPLICATIONS: A REVIEWFractals10.1142/S0218348X2430001032:03Online publication date: 5-Apr-2024
      • (2024)Hypergraph-Based Multi-Modal Representation for Open-Set 3D Object RetrievalIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333276846:4(2206-2223)Online publication date: Apr-2024
      • (2024)Query-Adaptive Late Fusion for Hierarchical Fine-Grained Video-Text RetrievalIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.321420835:5(7150-7161)Online publication date: May-2024
      • (2024)Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image–Text RetrievalIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318856935:2(2194-2207)Online publication date: Feb-2024
      • (2024)Multiview Deep Anomaly Detection: A Systematic ExplorationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318472335:2(1651-1665)Online publication date: Feb-2024
      • (2024)Weighted Graph-Structured Semantics Constraint Network for Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.328289426(1551-1564)Online publication date: 1-Jan-2024
      • (2024)Contextualized Relation Predictive Model for Self-Supervised Group Activity Representation LearningIEEE Transactions on Multimedia10.1109/TMM.2023.326528026(353-366)Online publication date: 1-Jan-2024
      • (2024)Structures Aware Fine-Grained Contrastive Adversarial Hashing for Cross-Media RetrievalIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335625836:7(3514-3528)Online publication date: Jul-2024
      • (2024)Semantics Disentangling for Cross-Modal RetrievalIEEE Transactions on Image Processing10.1109/TIP.2024.337411133(2226-2237)Online publication date: 2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media