research-article

Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER

Authors:

Xinyu DaiAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 3983 - 3992

https://doi.org/10.1145/3503161.3548228

Published: 10 October 2022 Publication History

Abstract

Multimodal Named Entity Recognition (MNER) aims to locate and classify named entities mentioned in a (text, image) pair. However, dominant work independently models the internal matching relations in a pair of image and text, ignoring the external matching relations between different (text, image) pairs inside the dataset, though such relations are crucial for alleviating image noise in MNER task. In this paper, we primarily explore two kinds of external matching relations between different (text, image) pairs, i.e., inter-modal relations and intra-modal relations. On the basis, we propose a Relation-enhanced Graph Convolutional Network (R-GCN) for the MNER task. Specifically, we first construct an inter-modal relation graph and an intra-modal relation graph to gather the image information most relevant to the current text and image from the dataset, respectively. And then, multimodal interaction and fusion are leveraged to predict the NER label sequences. Extensive experimental results show that our model consistently outperforms state-of-the-art works on two public datasets. Our code and datasets are available at https://github.com/1429904852/R-GCN.

Supplementary Material

MP4 File (MM22-fp1971.mp4)

In this paper, we propose a novel Relation-enhanced Graph Convolutional Network (R-GCN) for the MNER task. The main idea of our approach is to leverage two kinds of external matching relations in different (image, text) pairs to improve the ability to identify named entities in the text. Results from numerous experiments indicate that our model achieves better performance than other state-of-the-art methods.

Download
36.65 MB

References

[1]

Md. Shad Akhtar, Tarun Garg, and Asif Ekbal. 2020. Multi-task learning for aspect term extraction and aspect sentiment classification. Neurocomputing, Vol. 398 (2020), 247--256.

[2]

Dawei Chen, Zhixu Li, Binbin Gu, and Zhigang Chen. 2021b. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. In DASFAA 2021, Taipei, Taiwan, April 11--14, 2021, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12682). Springer, 186--201.

[3]

Shuguang Chen, Gustavo Aguilar, Leonardo Neves, and Thamar Solorio. 2021a. Can images help recognize entities? A study of the role of images for Multimodal NER. In W-NUT 2021, Online, November 11, 2021. Association for Computational Linguistics, 87--96.

[4]

Jason P. C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguistics, Vol. 4 (2016), 357--370.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171--4186.

[6]

Xinzhi Dong, Chengjiang Long, Wenju Xu, and Chunxia Xiao. 2021. Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning. In MM '21, Virtual Event, China, October 20 - 24, 2021. ACM, 2615--2624.

[7]

Octavian-Eugen Ganea and Thomas Hofmann. 2017. Deep Joint Entity Disambiguation with Local Neural Attention. In EMNLP 2017, Copenhagen, Denmark, September 9--11, 2017, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, 2619--2629.

[8]

Marco Gori, Gabriele Monfardini, and Franco Scarselli. 2005. A new model for learning in graph domains. In Proceedings. 2005 IEEE international joint conference on neural networks, Vol. 2. 729--734.

[9]

Kaiming He, Georgia Gkioxari, Piotr Dollá r, and Ross B. Girshick. 2020. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 42, 2 (2020), 386--397.

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 770--778.

[11]

Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, Vol. abs/1207.0580 (2012). showeprint[arXiv]1207.0580

[12]

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR, Vol. abs/1508.01991 (2015). showeprint[arXiv]1508.01991

[13]

Jun'ichi Kazama and Kentaro Torisawa. 2007. Exploiting Wikipedia as External Knowledge for Named Entity Recognition. In EMNLP-CoNLL 2007, June 28--30, 2007, Prague, Czech Republic. ACL, 698--707.

[14]

Zaid Khan and Yun Fu. 2021. Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation. In MM '21, Virtual Event, China, October 20 - 24, 2021. ACM, 3034--3042.

Digital Library

[15]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings.

[16]

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings. OpenReview.net.

[17]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL HLT 2016, San Diego California, USA, June 12--17, 2016. The Association for Computational Linguistics, 260--270.

[18]

Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual Attention Model for Name Tagging in Multimodal Social Media. In ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers. Association for Computational Linguistics, 1990--1999.

[19]

Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In ACL 2016, August 7--12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.

[20]

Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal Named Entity Recognition for Short Social Media Posts. In NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1--6, 2018, Volume 1 (Long Papers). Association for Computational Linguistics, 852--860.

[21]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada. 91--99.

Digital Library

[22]

Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. Representing Text Chunks. In EACL 1999, June 8--12, 1999, University of Bergen, Bergen, Norway. The Association for Computer Linguistics, 173--179.

[23]

Chengai Sun, Liangyu Lv, Gang Tian, and Tailu Liu. 2021a. Deep Interactive Memory Network for Aspect-Level Sentiment Analysis. ACM Trans. Asian Low Resour. Lang. Inf. Process., Vol. 20, 1 (2021), 3:1--3:12.

Digital Library

[24]

Lin Sun, Jiquan Wang, Yindu Su, Fangsheng Weng, Yuxuan Sun, Zengwei Zheng, and Yuanyi Chen. 2020. RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER. In COLING 2020, Barcelona, Spain (Online), December 8--13, 2020. International Committee on Computational Linguistics, 1852--1862.

[25]

Lin Sun, Jiquan Wang, Kai Zhang, Yindu Su, and Fangsheng Weng. 2021b. RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER. In AAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 13860--13868.

[26]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 6558--6569.

[27]

Sté phan Tulkens and Andreas van Cranenburgh. 2020. Embarrassingly Simple Unsupervised Aspect Extraction. In ACL 2020, Online, July 5--10, 2020. Association for Computational Linguistics, 3182--3187.

[28]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA. 5998--6008.

Digital Library

[29]

Petar Velivc ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In International Conference on Learning Representations.

[30]

Hanqian Wu, Siliang Cheng, Jingjing Wang, Shoushan Li, and Lian Chi. 2020a. Multimodal Aspect Extraction with Region-Aware Alignment Network. In NLPCC 2020, Zhengzhou, China, October 14--18, 2020, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 12430). Springer, 145--156.

[31]

Zhiwei Wu, Changmeng Zheng, Yi Cai, Junying Chen, Ho-fung Leung, and Qing Li. 2020b. Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts. In MM '20, Virtual Event / Seattle, WA, USA, October 12--16, 2020. ACM, 1038--1046.

[32]

Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design Challenges and Misconceptions in Neural Sequence Labeling. In COLING 2018, Santa Fe, New Mexico, USA, August 20--26, 2018. Association for Computational Linguistics, 3879--3889.

[33]

Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia. 2020. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. In ACL 2020, Online, July 5--10, 2020. Association for Computational Linguistics, 3342--3352.

[34]

Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2002. Kernel Methods for Relation Extraction. In EMNLP 2002, Philadelphia, PA, USA, July 6--7, 2002. 71--78.

[35]

Dong Zhang, Xincheng Ju, Wei Zhang, Junhui Li, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2021a. Multi-modal Multi-label Emotion Recognition with Heterogeneous Hierarchical Message Passing. In AAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 14338--14346.

[36]

Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. 2021b. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In AAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 14347--14355.

[37]

Min Zhang, Guodong Zhou, Lingpeng Yang, and Dong-Hong Ji. 2006. Chinese Word Segmentation and Named Entity Recognition Based on a Context-Dependent Mutual Information Independence Model. In SIGHAN@COLING/ACL 2006, Sydney, Australia, July 22--23, 2006. Association for Computational Linguistics, 154--157.

[38]

Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive Co-attention Network for Named Entity Recognition in Tweets. In (AAAI-18), New Orleans, Louisiana, USA, February 2--7, 2018. AAAI Press, 5674--5681.

[39]

Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-Aware BERT for Language Understanding. In AAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 9628--9635.

[40]

Changmeng Zheng, Zhiwei Wu, Tao Wang, Yi Cai, and Qing Li. 2021. Object-Aware Multimodal Named Entity Recognition in Social Media Posts With Adversarial Learning. IEEE Trans. Multim., Vol. 23 (2021), 2520--2532.

Digital Library

[41]

Guodong Zhou and Jian Su. 2005. Machine learning-based named entity recognition via effective integration of various evidences. Nat. Lang. Eng. (2005).

Cited By

He LWang QLiu JDuan JWang H(2024)Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity RecognitionApplied Sciences10.3390/app1406233314:6(2333)Online publication date: 10-Mar-2024
https://doi.org/10.3390/app14062333
Huang GHe QDai ZZhong GYuan XPun C(2024)GDN-CMCF: A Gated Disentangled Network With Cross-Modality Consensus Fusion for Multimodal Named Entity RecognitionIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.332340211:3(3944-3954)Online publication date: Jun-2024
https://doi.org/10.1109/TCSS.2023.3323402
Mai WZhang ZLi KXue YLi F(2024)Dynamic Graph Construction Framework for Multimodal Named Entity Recognition in Social MediaIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.330302711:2(2513-2522)Online publication date: Apr-2024
https://doi.org/10.1109/TCSS.2023.3303027
Show More Cited By

Index Terms

Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Fine-Grained Multimodal Named Entity Recognition and Grounding with a Generative Framework
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Multimodal Named Entity Recognition (MNER) aims to locate and classify named entities mentioned in a pair of text and image. However, most previous MNER works focus on extracting entities in the form of text but failing to ground text symbols to their ...
Wukong-CMNER: A Large-Scale Chinese Multimodal NER Dataset with Images Modality
Database Systems for Advanced Applications
Abstract
So far, Multimodal Named Entity Recognition (MNER) has been performed almost exclusively on English corpora. Chinese phrases are not naturally segmented, making Chinese NER more challenging; nonetheless, Chinese MNER needs to be paid more ...
Multimodal Named Entity Recognition and Relation Extraction with Retrieval-Augmented Strategy
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE) are tasks in information retrieval that aim to recognize entities and extract relations among them using information from multiple modalities, such as text and images. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
843
Total Downloads

Downloads (Last 12 months)284
Downloads (Last 6 weeks)15

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

He LWang QLiu JDuan JWang H(2024)Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity RecognitionApplied Sciences10.3390/app1406233314:6(2333)Online publication date: 10-Mar-2024
https://doi.org/10.3390/app14062333
Huang GHe QDai ZZhong GYuan XPun C(2024)GDN-CMCF: A Gated Disentangled Network With Cross-Modality Consensus Fusion for Multimodal Named Entity RecognitionIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.332340211:3(3944-3954)Online publication date: Jun-2024
https://doi.org/10.1109/TCSS.2023.3323402
Mai WZhang ZLi KXue YLi F(2024)Dynamic Graph Construction Framework for Multimodal Named Entity Recognition in Social MediaIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.330302711:2(2513-2522)Online publication date: Apr-2024
https://doi.org/10.1109/TCSS.2023.3303027
Cheng JLong KZhang SZhang TMa LCheng SGuo Y(2024)Text-Image Scene Graph Fusion for Multimodal Named Entity RecognitionIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33264165:6(2828-2839)Online publication date: Jun-2024
https://doi.org/10.1109/TAI.2023.3326416
Chen BLiu RCao DLiu G(2024)Chinese Multimodal Named Entity Recognition in Conversational Scenarios2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT)10.1109/AINIT61980.2024.10581703(1596-1601)Online publication date: 29-Mar-2024
https://doi.org/10.1109/AINIT61980.2024.10581703
Liu CYang DYu BBu L(2024)DGHC: A Hybrid Algorithm for Multi-Modal Named Entity Recognition Using Dynamic Gating and Correlation Coefficients With Visual EnhancementsIEEE Access10.1109/ACCESS.2024.340025012(69151-69162)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3400250
Gong YLv XYuan ZYou XHu FChen Y(2024)GNN-Based Multimodal Named Entity RecognitionThe Computer Journal10.1093/comjnl/bxae03067:8(2622-2632)Online publication date: 6-Apr-2024
https://doi.org/10.1093/comjnl/bxae030
Pan XLi XLi QHu ZBao J(2024)Evolving to multi-modal knowledge graphs for engineering design: state-of-the-art and future challengesJournal of Engineering Design10.1080/09544828.2023.2301230(1-40)Online publication date: 6-Jan-2024
https://doi.org/10.1080/09544828.2023.2301230
Gong YLv XYuan ZWang ZHu FYou X(2024)Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidanceThe Journal of Supercomputing10.1007/s11227-024-06347-8Online publication date: 22-Jul-2024
https://doi.org/10.1007/s11227-024-06347-8
Liu WRen AWang CPeng YXie SLi W(2024)MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognitionMultimedia Tools and Applications10.1007/s11042-024-18472-w83:28(71639-71663)Online publication date: 8-Feb-2024
https://doi.org/10.1007/s11042-024-18472-w
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents