Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2983563.2983570acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cross-modal Classification by Completing Unimodal Representations

Published: 16 October 2016 Publication History

Abstract

We argue that cross-modal classification, where models are trained on data from one modality (e.g. text) and applied to data from another (e.g. image), is a relevant problem in multimedia retrieval. We propose a method that addresses this specific problem, related to but different from cross-modal retrieval and bimodal classification. This method relies on a common latent space where both modalities have comparable representations and on an auxiliary dataset from which we build a more complete bimodal representation of any unimodal data. Evaluations on Pascal VOC07 and NUS-WIDE show that the novel representation method significantly improves the results compared to the use of a latent space alone. The level of performance achieved makes cross-modal classification a convincing choice for real applications.

References

[1]
U. Ahsan and I. Essa. Clustering social event images using kernel canonical correlation analysis. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pages 814--819, June 2014.
[2]
L. Bottou. Large-scale machine learning with stochastic gradient descent. In Y. Lechevallier and G. Saporta, editors, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010), pages 177--187, Paris, France, August 2010. Springer.
[3]
Y.-L. Boureau, J. Ponce, and Y. Lecun. A theoretical analysis of feature pooling in visual recognition. In ICML, Haifa, Israel, 2010.
[4]
X. Chen, Y. Mu, S. Yan, and T.-S. Chua. Efficient large-scale image annotation by probabilistic collaborative multi-label propagation. In Proceedings of the 18th ACM International Conference on Multimedia, pages 35--44, NY, USA, 2010.
[5]
T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. NUS-WIDE: A real-world web image database from National University of Singapore. In Proc. of ACM Conference on Image and Video Retrieval (CIVR'09), Santorini, Greece., July 8-10, 2009.
[6]
J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. Lanckriet, R. Levy, and N. Vasconcelos. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):521--535, 2014.
[7]
F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In Proc. of ACM International Conference on Multimedia, MM '14, 2014.
[8]
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pages 2121--2129, 2013.
[9]
Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision, 106(2):210--233, Jan. 2014.
[10]
D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639--2664, 2004.
[11]
M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, pages 853--899, 2013.
[12]
S. J. Hwang and K. Grauman. Learning the relative importance of objects from tagged images for retrieval and cross-modal search. International Journal of Computer Vision, 100(2):134--153, 2012.
[13]
S. J. Hwang and K. Grauman. Reading between the lines: Object localization using implicit cues from image tags. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(6):1145--1158, June 2012.
[14]
A. Joly and O. Buisson. Random maximum margin hashing. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pages 873--880. IEEE, 2011.
[15]
A. Karpathy, A. Joulin, and F. F. F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pages 1889--1897, 2014.
[16]
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546, 2013.
[17]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In International Conference on Machine Learning (ICML), pages 689--696, 2011.
[18]
D. Novak, M. Batko, and P. Zezula. Large-scale image retrieval using neural net descriptors. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 9-13, 2015, pages 1039--1040, 2015.
[19]
F. Perronnin, J. Sanchez, and Y. Liu. Large-scale image categorization with explicit data embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2297--2304, June 2010.
[20]
A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. CoRR, abs/1403.6382, 2014.
[21]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[22]
N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. In Advances in neural information processing systems, pages 2222--2230, 2012.
[23]
G. Wang, D. Hoiem, and D. A. Forsyth. Building text features for object image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1367--1374. IEEE Computer Society, 2009.
[24]
W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multi-view representation learning. In International Conference on Machine Learning (ICML), Lille, France, 2015.

Cited By

View all
  • (2023)Learning semantic ambiguities for zero-shot learningMultimedia Tools and Applications10.1007/s11042-023-14877-182:26(40745-40759)Online publication date: 31-Mar-2023
  • (2019)A cross-modal multimedia retrieval method using depth correlation mining in big data environmentMultimedia Tools and Applications10.1007/s11042-019-08238-0Online publication date: 30-Oct-2019
  • (2017)AMECONProceedings of the 2017 ACM on International Conference on Multimedia Retrieval10.1145/3078971.3078993(347-355)Online publication date: 6-Jun-2017

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
iV&L-MM '16: Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion
October 2016
70 pages
ISBN:9781450345194
DOI:10.1145/2983563
© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. kernel canonical correlation analysis
  2. latent representations
  3. unimodal data completion

Qualifiers

  • Research-article

Funding Sources

  • USEMP FP7

Conference

MM '16
Sponsor:
MM '16: ACM Multimedia Conference
October 16, 2016
Amsterdam, The Netherlands

Acceptance Rates

iV&L-MM '16 Paper Acceptance Rate 7 of 15 submissions, 47%;
Overall Acceptance Rate 7 of 15 submissions, 47%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Learning semantic ambiguities for zero-shot learningMultimedia Tools and Applications10.1007/s11042-023-14877-182:26(40745-40759)Online publication date: 31-Mar-2023
  • (2019)A cross-modal multimedia retrieval method using depth correlation mining in big data environmentMultimedia Tools and Applications10.1007/s11042-019-08238-0Online publication date: 30-Oct-2019
  • (2017)AMECONProceedings of the 2017 ACM on International Conference on Multimedia Retrieval10.1145/3078971.3078993(347-355)Online publication date: 6-Jun-2017

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media