Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3078971.3078976acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article
Public Access

MSRC: Multimodal Spatial Regression with Semantic Context for Phrase Grounding

Published: 06 June 2017 Publication History

Abstract

Given an image and a natural language query phrase, a grounding system localizes the mentioned objects in the image according to the query's specifications. State-of-the-art methods address the problem by ranking a set of proposal bounding boxes according to the query's semantics, which makes them dependent on the performance of proposal generation systems. Besides, query phrases in one sentence may be semantically related in one sentence and can provide useful cues to ground objects. We propose a novel Multimodal Spatial Regression with semantic Context (MSRC) system which not only predicts the location of ground truth based on proposal bounding boxes, but also refines prediction results by penalizing similarities of different queries coming from same sentences. The advantages of MSRC are twofold: first, it removes the limitation of performance from proposal generation algorithms by using a spatial regression network. Second, MSRC not only encodes the semantics of a query phrase, but also deals with its relation with other queries in the same sentence (i.e., context) by a context refinement network. Experiments show MSRC system provides a significant improvement in accuracy on two popular datasets: Flickr30K Entities and Refer-it Game, with 6.64% and 5.28% increase over the state-of-the-arts respectively.

References

[1]
K. Andrej and F.-F. Li. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.
[2]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C Lawrence Z., and D. Parikh. 2015. VQA: Visual question answering. In ICCV.
[3]
K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia. 2015. ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 (2015).
[4]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F Li. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
[5]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The PASCAL Visual Object Classes Challenge. In IJCV.
[6]
H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C Platt, and others. 2015. From captions to visual concepts and back. In CVPR.
[7]
A. Fukui, D. H Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP (2016).
[8]
R. Girshick. 2015. Fast R-CNN. In ICCV.
[9]
X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Aistats.
[10]
A. Gordo, J. Almazán, J. Revaud, and D. Larlus. 2016. Deep image retrieval: Learning global representations for image search. In ECCV.
[11]
K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving deep into recitifers: Surpassing human-level performance on imagenet classification. In CVPR.
[12]
S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural computation (1997).
[13]
R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. 2016. Natural language object retrieval. In CVPR.
[14]
J. Justin, K. Andrej, and F.-F. Li. 2016. Densecap: Fully convolutional localization networks for dense captioning. In CVPR.
[15]
Sahar K., Vicente O., Mark M., and Tamara L. B. 2014. ReferIt Game: Referring to Objects in Photographs of Natural Scenes. In EMNLP.
[16]
V. Kantorov, M. Oquab, M. Cho, and I. Laptev. 2016. Contextlocnet: Context-aware deep network models for weakly supervised localization. In ECCV.
[17]
A. Karpathy, A. Joulin, and F.-F. Li. 2014. Deep fragment embeddings for bidirec- tional image sentence mapping. In NIPS.
[18]
D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[19]
J. Krishnamurthy and T. Kollar. 2013. Jointly learning to parse and perceive: Connecting natural language to the physical world. TACL (2013).
[20]
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. 2016. SSD: Single shot multibox detector. In ECCV.
[21]
C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and D. Fox. 2012. A joint model of language and perception for grounded attribute learning. ICML (2012).
[22]
V. K Nagaraja, V. I Morariu, and L. S Davis. 2016. Modeling context between objects for referring expression understanding. In ECCV.
[23]
B. A Plummer, L. Wang, C. M Cervantes, J. C Caicedo, J. Hockenmaier, and S. Lazebnik. 2016. Flickr30k Entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In IJCV.
[24]
F. Radenović, G. Tolias, and O. Chum. 2016. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In ECCV.
[25]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. 2016. You only look once: Unified, real-time object detection. In CVPR.
[26]
S. Ren, K. He, R. Girshick, and J. Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.
[27]
A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. 2016. Grounding of textual phrases in images by reconstruction. In ECCV.
[28]
K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR (2014).
[29]
J. R. Uijlings, Koen E. Van D. S., T. Gevers, and A. W. Smeulders. 2013. Selective search for object recognition. IJCV (2013).
[30]
M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng. 2016. Structured matching for phrase localization. In ECCV.
[31]
L. Yu, P. Poirson, S. Yang, A. C Berg, and T. L Berg. 2016. Modeling context in referring expressions. In ECCV.
[32]
C. L. Zitnick and P. Dollár. 2014. Edge boxes: Locating object proposals from edges. In ECCV

Cited By

View all
  • (2024)MFVG: A Visual Grounding Network with Multi-scale FusionProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658002(713-721)Online publication date: 30-May-2024
  • (2023)Weakly-Supervised Video Object Grounding via Learning Uni-Modal AssociationsIEEE Transactions on Multimedia10.1109/TMM.2022.320758125(6329-6340)Online publication date: 2023
  • (2023)One for all: One-stage referring expression comprehension with dynamic reasoningNeurocomputing10.1016/j.neucom.2022.10.022518(523-532)Online publication date: Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval
June 2017
524 pages
ISBN:9781450347013
DOI:10.1145/3078971
  • General Chairs:
  • Bogdan Ionescu,
  • Nicu Sebe,
  • Program Chairs:
  • Jiashi Feng,
  • Martha Larson,
  • Rainer Lienhart,
  • Cees Snoek
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. context
  2. multimodal
  3. phrase grounding
  4. spatial regression

Qualifiers

  • Research-article

Funding Sources

  • Defense Advanced Research Projects Agency
  • Air Force Research Laboratory

Conference

ICMR '17
Sponsor:

Acceptance Rates

ICMR '17 Paper Acceptance Rate 33 of 95 submissions, 35%;
Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)7
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)MFVG: A Visual Grounding Network with Multi-scale FusionProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658002(713-721)Online publication date: 30-May-2024
  • (2023)Weakly-Supervised Video Object Grounding via Learning Uni-Modal AssociationsIEEE Transactions on Multimedia10.1109/TMM.2022.320758125(6329-6340)Online publication date: 2023
  • (2023)One for all: One-stage referring expression comprehension with dynamic reasoningNeurocomputing10.1016/j.neucom.2022.10.022518(523-532)Online publication date: Jan-2023
  • (2022)Skimming, Locating, then Perusing: A Human-Like Framework for Natural Language Video LocalizationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547782(4536-4545)Online publication date: 10-Oct-2022
  • (2022)Weakly-Supervised Video Object Grounding via Causal InterventionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.3180025(1-1)Online publication date: 2022
  • (2022)Revisiting Image-Language Networks for Open-Ended Phrase DetectionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2020.302900844:4(2155-2167)Online publication date: 1-Apr-2022
  • (2022)Progressive Language-Customized Visual Feature Learning for One-Stage Visual GroundingIEEE Transactions on Image Processing10.1109/TIP.2022.318151631(4266-4277)Online publication date: 2022
  • (2021)Virtual Reality in Object LocationLatin American Women and Research Contributions to the IT Field10.4018/978-1-7998-7552-9.ch014(307-324)Online publication date: 2021
  • (2021)One-Stage Visual Grounding via Semantic-Aware Feature FilterProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475313(1702-1711)Online publication date: 17-Oct-2021
  • (2021)Weakly-Supervised Video Object Grounding via Stable Context LearningProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475245(760-768)Online publication date: 17-Oct-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media