Multimodal Convolutional Neural Networks for Matching Image and Sentence

Ma, Lin; Lu, Zhengdong; Shang, Lifeng; Li, Hang

Computer Science > Computer Vision and Pattern Recognition

arXiv:1504.06063 (cs)

[Submitted on 23 Apr 2015 (v1), last revised 29 Aug 2015 (this version, v5)]

Title:Multimodal Convolutional Neural Networks for Matching Image and Sentence

Authors:Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li

View PDF

Abstract:In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence. The matching CNN composes words to different semantic fragments and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results on benchmark databases of bidirectional image and sentence retrieval demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching. Specifically, our proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and Microsoft COCO databases achieve the state-of-the-art performances.

Comments:	Accepted by ICCV 2015
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
Cite as:	arXiv:1504.06063 [cs.CV]
	(or arXiv:1504.06063v5 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1504.06063

Submission history

From: Lin Ma [view email]
[v1] Thu, 23 Apr 2015 07:10:13 UTC (864 KB)
[v2] Tue, 28 Apr 2015 01:47:05 UTC (864 KB)
[v3] Tue, 7 Jul 2015 08:09:54 UTC (896 KB)
[v4] Tue, 21 Jul 2015 07:30:53 UTC (899 KB)
[v5] Sat, 29 Aug 2015 09:35:09 UTC (899 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Convolutional Neural Networks for Matching Image and Sentence

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Convolutional Neural Networks for Matching Image and Sentence

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators