research-article

Public Access

MSRC: Multimodal Spatial Regression with Semantic Context for Phrase Grounding

Authors:

Ram NevatiaAuthors Info & Claims

ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

Pages 23 - 31

https://doi.org/10.1145/3078971.3078976

Published: 06 June 2017 Publication History

Abstract

Given an image and a natural language query phrase, a grounding system localizes the mentioned objects in the image according to the query's specifications. State-of-the-art methods address the problem by ranking a set of proposal bounding boxes according to the query's semantics, which makes them dependent on the performance of proposal generation systems. Besides, query phrases in one sentence may be semantically related in one sentence and can provide useful cues to ground objects. We propose a novel Multimodal Spatial Regression with semantic Context (MSRC) system which not only predicts the location of ground truth based on proposal bounding boxes, but also refines prediction results by penalizing similarities of different queries coming from same sentences. The advantages of MSRC are twofold: first, it removes the limitation of performance from proposal generation algorithms by using a spatial regression network. Second, MSRC not only encodes the semantics of a query phrase, but also deals with its relation with other queries in the same sentence (i.e., context) by a context refinement network. Experiments show MSRC system provides a significant improvement in accuracy on two popular datasets: Flickr30K Entities and Refer-it Game, with 6.64% and 5.28% increase over the state-of-the-arts respectively.

References

[1]

K. Andrej and F.-F. Li. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.

[2]

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C Lawrence Z., and D. Parikh. 2015. VQA: Visual question answering. In ICCV.

Digital Library

[3]

K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia. 2015. ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 (2015).

[4]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F Li. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.

[5]

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The PASCAL Visual Object Classes Challenge. In IJCV.

Digital Library

[6]

H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C Platt, and others. 2015. From captions to visual concepts and back. In CVPR.

[7]

A. Fukui, D. H Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP (2016).

[8]

R. Girshick. 2015. Fast R-CNN. In ICCV.

Digital Library

[9]

X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Aistats.

[10]

A. Gordo, J. Almazán, J. Revaud, and D. Larlus. 2016. Deep image retrieval: Learning global representations for image search. In ECCV.

[11]

K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving deep into recitifers: Surpassing human-level performance on imagenet classification. In CVPR.

[12]

S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural computation (1997).

[13]

R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. 2016. Natural language object retrieval. In CVPR.

[14]

J. Justin, K. Andrej, and F.-F. Li. 2016. Densecap: Fully convolutional localization networks for dense captioning. In CVPR.

[15]

Sahar K., Vicente O., Mark M., and Tamara L. B. 2014. ReferIt Game: Referring to Objects in Photographs of Natural Scenes. In EMNLP.

[16]

V. Kantorov, M. Oquab, M. Cho, and I. Laptev. 2016. Contextlocnet: Context-aware deep network models for weakly supervised localization. In ECCV.

[17]

A. Karpathy, A. Joulin, and F.-F. Li. 2014. Deep fragment embeddings for bidirec- tional image sentence mapping. In NIPS.

Digital Library

[18]

D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[19]

J. Krishnamurthy and T. Kollar. 2013. Jointly learning to parse and perceive: Connecting natural language to the physical world. TACL (2013).

[20]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. 2016. SSD: Single shot multibox detector. In ECCV.

[21]

C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and D. Fox. 2012. A joint model of language and perception for grounded attribute learning. ICML (2012).

Digital Library

[22]

V. K Nagaraja, V. I Morariu, and L. S Davis. 2016. Modeling context between objects for referring expression understanding. In ECCV.

[23]

B. A Plummer, L. Wang, C. M Cervantes, J. C Caicedo, J. Hockenmaier, and S. Lazebnik. 2016. Flickr30k Entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In IJCV.

Digital Library

[24]

F. Radenović, G. Tolias, and O. Chum. 2016. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In ECCV.

[25]

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. 2016. You only look once: Unified, real-time object detection. In CVPR.

[26]

S. Ren, K. He, R. Girshick, and J. Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.

Digital Library

[27]

A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. 2016. Grounding of textual phrases in images by reconstruction. In ECCV.

[28]

K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR (2014).

[29]

J. R. Uijlings, Koen E. Van D. S., T. Gevers, and A. W. Smeulders. 2013. Selective search for object recognition. IJCV (2013).

Digital Library

[30]

M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng. 2016. Structured matching for phrase localization. In ECCV.

[31]

L. Yu, P. Poirson, S. Yang, A. C Berg, and T. L Berg. 2016. Modeling context in referring expressions. In ECCV.

[32]

C. L. Zitnick and P. Dollár. 2014. Edge boxes: Locating object proposals from edges. In ECCV

Cited By

Chen PQi KTao XXu WZhang JGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)MFVG: A Visual Grounding Network with Multi-scale FusionProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658002(713-721)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658002
Wang WGao JXu C(2023)Weakly-Supervised Video Object Grounding via Learning Uni-Modal AssociationsIEEE Transactions on Multimedia10.1109/TMM.2022.320758125(6329-6340)Online publication date: 2023
https://doi.org/10.1109/TMM.2022.3207581
Zhang ZWei ZHuang ZNiu RWang P(2023)One for all: One-stage referring expression comprehension with dynamic reasoningNeurocomputing10.1016/j.neucom.2022.10.022518(523-532)Online publication date: Jan-2023
https://doi.org/10.1016/j.neucom.2022.10.022
Show More Cited By

Index Terms

MSRC: Multimodal Spatial Regression with Semantic Context for Phrase Grounding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
        Visual content-based indexing and retrieval
    2. Natural language processing
      1. Information extraction

Recommendations

Cross-Modal Omni Interaction Modeling for Phrase Grounding
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Phrase grounding aims to localize the objects described by phrases in a natural language specification. Previous works model the interaction of inputs from text modality and visual modality only in the intra-modal global level and consequently lacks the ...
PIRC Net: Using Proposal Indexing, Relationships and Context for Phrase Grounding
Computer Vision – ACCV 2018
Abstract
Phrase Grounding aims to detect and localize objects in images that are referred to and are queried by natural language phrases. Phrase grounding finds applications in tasks such as Visual Dialog, Visual Search and Image-text co-reference ...
Improving weakly supervised phrase grounding via visual representation contextualization with contrastive learning
Abstract
Weakly supervised phrase grounding aims to map the phrases in an image caption to the objects appearing in the image under the supervision of image-caption correspondence. We observe that the current studies are insufficient to model the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

June 2017

524 pages

ISBN:9781450347013

DOI:10.1145/3078971

General Chairs:
Bogdan Ionescu
University Politehnica of Bucharest, Romania
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Jiashi Feng
National University of Singapore, Singapore
,
Martha Larson
Radboud University & Delft University of Technology, The Netherlands
,
Rainer Lienhart
University of Augsburg, Germany
,
Cees Snoek
University of Amsterdam & Qualcomm Research Netherlands, The Netherlands

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Defense Advanced Research Projects Agency
Air Force Research Laboratory

Conference

ICMR '17

Sponsor:

SIGMM

ICMR '17: International Conference on Multimedia Retrieval

June 6 - 9, 2017

Bucharest, Romania

Acceptance Rates

ICMR '17 Paper Acceptance Rate 33 of 95 submissions, 35%;

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
369
Total Downloads

Downloads (Last 12 months)58
Downloads (Last 6 weeks)12

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen PQi KTao XXu WZhang JGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)MFVG: A Visual Grounding Network with Multi-scale FusionProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658002(713-721)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658002
Wang WGao JXu C(2023)Weakly-Supervised Video Object Grounding via Learning Uni-Modal AssociationsIEEE Transactions on Multimedia10.1109/TMM.2022.320758125(6329-6340)Online publication date: 2023
https://doi.org/10.1109/TMM.2022.3207581
Zhang ZWei ZHuang ZNiu RWang P(2023)One for all: One-stage referring expression comprehension with dynamic reasoningNeurocomputing10.1016/j.neucom.2022.10.022518(523-532)Online publication date: Jan-2023
https://doi.org/10.1016/j.neucom.2022.10.022
Liu DHu WMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Skimming, Locating, then Perusing: A Human-Like Framework for Natural Language Video LocalizationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547782(4536-4545)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3547782
Wang WGao JXu C(2022)Weakly-Supervised Video Object Grounding via Causal InterventionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.3180025(1-1)Online publication date: 2022
https://doi.org/10.1109/TPAMI.2022.3180025
Plummer BShih KLi YXu KLazebnik SSclaroff SSaenko K(2022)Revisiting Image-Language Networks for Open-Ended Phrase DetectionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2020.302900844:4(2155-2167)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TPAMI.2020.3029008
Liao YZhang AChen ZHui TLiu S(2022)Progressive Language-Customized Visual Feature Learning for One-Stage Visual GroundingIEEE Transactions on Image Processing10.1109/TIP.2022.318151631(4266-4277)Online publication date: 2022
https://doi.org/10.1109/TIP.2022.3181516
Lara López G(2021)Virtual Reality in Object LocationLatin American Women and Research Contributions to the IT Field10.4018/978-1-7998-7552-9.ch014(307-324)Online publication date: 2021
https://doi.org/10.4018/978-1-7998-7552-9.ch014
Ye JLin XHe LLi DChen QShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)One-Stage Visual Grounding via Semantic-Aware Feature FilterProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475313(1702-1711)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475313
Wang WGao JXu CShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Weakly-Supervised Video Object Grounding via Stable Context LearningProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475245(760-768)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475245
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten