research-article

Cross-modal Classification by Completing Unimodal Representations

Authors:

Thi Quynh Nhi Tran,

Hervé Le Borgne,

Michel CrucianuAuthors Info & Claims

iV&L-MM '16: Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion

Pages 17 - 25

https://doi.org/10.1145/2983563.2983570

Published: 16 October 2016 Publication History

Abstract

We argue that cross-modal classification, where models are trained on data from one modality (e.g. text) and applied to data from another (e.g. image), is a relevant problem in multimedia retrieval. We propose a method that addresses this specific problem, related to but different from cross-modal retrieval and bimodal classification. This method relies on a common latent space where both modalities have comparable representations and on an auxiliary dataset from which we build a more complete bimodal representation of any unimodal data. Evaluations on Pascal VOC07 and NUS-WIDE show that the novel representation method significantly improves the results compared to the use of a latent space alone. The level of performance achieved makes cross-modal classification a convincing choice for real applications.

References

[1]

U. Ahsan and I. Essa. Clustering social event images using kernel canonical correlation analysis. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pages 814--819, June 2014.

Digital Library

[2]

L. Bottou. Large-scale machine learning with stochastic gradient descent. In Y. Lechevallier and G. Saporta, editors, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010), pages 177--187, Paris, France, August 2010. Springer.

[3]

Y.-L. Boureau, J. Ponce, and Y. Lecun. A theoretical analysis of feature pooling in visual recognition. In ICML, Haifa, Israel, 2010.

[4]

X. Chen, Y. Mu, S. Yan, and T.-S. Chua. Efficient large-scale image annotation by probabilistic collaborative multi-label propagation. In Proceedings of the 18th ACM International Conference on Multimedia, pages 35--44, NY, USA, 2010.

Digital Library

[5]

T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. NUS-WIDE: A real-world web image database from National University of Singapore. In Proc. of ACM Conference on Image and Video Retrieval (CIVR'09), Santorini, Greece., July 8-10, 2009.

Digital Library

[6]

J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. Lanckriet, R. Levy, and N. Vasconcelos. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):521--535, 2014.

Digital Library

[7]

F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In Proc. of ACM International Conference on Multimedia, MM '14, 2014.

Digital Library

[8]

A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pages 2121--2129, 2013.

Digital Library

[9]

Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision, 106(2):210--233, Jan. 2014.

Digital Library

[10]

D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639--2664, 2004.

Digital Library

[11]

M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, pages 853--899, 2013.

Digital Library

[12]

S. J. Hwang and K. Grauman. Learning the relative importance of objects from tagged images for retrieval and cross-modal search. International Journal of Computer Vision, 100(2):134--153, 2012.

Digital Library

[13]

S. J. Hwang and K. Grauman. Reading between the lines: Object localization using implicit cues from image tags. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(6):1145--1158, June 2012.

Digital Library

[14]

A. Joly and O. Buisson. Random maximum margin hashing. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pages 873--880. IEEE, 2011.

Digital Library

[15]

A. Karpathy, A. Joulin, and F. F. F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pages 1889--1897, 2014.

Digital Library

[16]

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546, 2013.

[17]

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In International Conference on Machine Learning (ICML), pages 689--696, 2011.

Digital Library

[18]

D. Novak, M. Batko, and P. Zezula. Large-scale image retrieval using neural net descriptors. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 9-13, 2015, pages 1039--1040, 2015.

Digital Library

[19]

F. Perronnin, J. Sanchez, and Y. Liu. Large-scale image categorization with explicit data embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2297--2304, June 2010.

[20]

A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. CoRR, abs/1403.6382, 2014.

[21]

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.

[22]

N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. In Advances in neural information processing systems, pages 2222--2230, 2012.

Digital Library

[23]

G. Wang, D. Hoiem, and D. A. Forsyth. Building text features for object image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1367--1374. IEEE Computer Society, 2009.

[24]

W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multi-view representation learning. In International Conference on Machine Learning (ICML), Lille, France, 2015.

Cited By

Hanouti CLe Borgne H(2023)Learning semantic ambiguities for zero-shot learningMultimedia Tools and Applications10.1007/s11042-023-14877-182:26(40745-40759)Online publication date: 31-Mar-2023
https://doi.org/10.1007/s11042-023-14877-1
Xia DMiao LFan A(2019)A cross-modal multimedia retrieval method using depth correlation mining in big data environmentMultimedia Tools and Applications10.1007/s11042-019-08238-0Online publication date: 30-Oct-2019
https://doi.org/10.1007/s11042-019-08238-0
Chami ITamaazousti YLe Borgne HIonescu BSebe NFeng JLarson MLienhart RSnoek C(2017)AMECONProceedings of the 2017 ACM on International Conference on Multimedia Retrieval10.1145/3078971.3078993(347-355)Online publication date: 6-Jun-2017
https://dl.acm.org/doi/10.1145/3078971.3078993

Index Terms

Cross-modal Classification by Completing Unimodal Representations
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Adversarial Cross-Modal Retrieval
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly ...
Overview of Uni-modal and Multi-modal Representations for Classification Tasks
Natural Language Processing and Information Systems
Abstract
Classification is one of the most fundamental tasks in data mining and machine learning. It is being applied in an increasing number of fields, e.g. filtering, identification, information retrieval, information extraction, and similarity ...
Hybrid cross-modal interaction learning for multimodal sentiment analysis
Abstract
Multimodal sentiment analysis (MSA) predicts the sentiment polarity of an unlabeled utterance that carries multiple modalities, such as text, vision and audio, by analyzing labeled utterances. Existing fusion methods mainly focus on establishing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

iV&L-MM '16: Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion

October 2016

70 pages

ISBN:9781450345194

DOI:10.1145/2983563

General Chairs:
Marie-Francine Moens
KU Leuven, Belgium
,
Katerina Pastra
Cognitive Systems Research Institute, Greece
,
Kate Saenko
Boston University, USA
,
Tinne Tuytelaars
KU Leuven, Belgium

Copyright © 2016 ACM.

© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

USEMP FP7

Conference

MM '16

Sponsor:

SIGMM

MM '16: ACM Multimedia Conference

October 16, 2016

Amsterdam, The Netherlands

Acceptance Rates

iV&L-MM '16 Paper Acceptance Rate 7 of 15 submissions, 47%;

Overall Acceptance Rate 7 of 15 submissions, 47%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
111
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hanouti CLe Borgne H(2023)Learning semantic ambiguities for zero-shot learningMultimedia Tools and Applications10.1007/s11042-023-14877-182:26(40745-40759)Online publication date: 31-Mar-2023
https://doi.org/10.1007/s11042-023-14877-1
Xia DMiao LFan A(2019)A cross-modal multimedia retrieval method using depth correlation mining in big data environmentMultimedia Tools and Applications10.1007/s11042-019-08238-0Online publication date: 30-Oct-2019
https://doi.org/10.1007/s11042-019-08238-0
Chami ITamaazousti YLe Borgne HIonescu BSebe NFeng JLarson MLienhart RSnoek C(2017)AMECONProceedings of the 2017 ACM on International Conference on Multimedia Retrieval10.1145/3078971.3078993(347-355)Online publication date: 6-Jun-2017
https://dl.acm.org/doi/10.1145/3078971.3078993

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents