Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548228acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER

Published: 10 October 2022 Publication History

Abstract

Multimodal Named Entity Recognition (MNER) aims to locate and classify named entities mentioned in a (text, image) pair. However, dominant work independently models the internal matching relations in a pair of image and text, ignoring the external matching relations between different (text, image) pairs inside the dataset, though such relations are crucial for alleviating image noise in MNER task. In this paper, we primarily explore two kinds of external matching relations between different (text, image) pairs, i.e., inter-modal relations and intra-modal relations. On the basis, we propose a Relation-enhanced Graph Convolutional Network (R-GCN) for the MNER task. Specifically, we first construct an inter-modal relation graph and an intra-modal relation graph to gather the image information most relevant to the current text and image from the dataset, respectively. And then, multimodal interaction and fusion are leveraged to predict the NER label sequences. Extensive experimental results show that our model consistently outperforms state-of-the-art works on two public datasets. Our code and datasets are available at https://github.com/1429904852/R-GCN.

Supplementary Material

MP4 File (MM22-fp1971.mp4)
In this paper, we propose a novel Relation-enhanced Graph Convolutional Network (R-GCN) for the MNER task. The main idea of our approach is to leverage two kinds of external matching relations in different (image, text) pairs to improve the ability to identify named entities in the text. Results from numerous experiments indicate that our model achieves better performance than other state-of-the-art methods.

References

[1]
Md. Shad Akhtar, Tarun Garg, and Asif Ekbal. 2020. Multi-task learning for aspect term extraction and aspect sentiment classification. Neurocomputing, Vol. 398 (2020), 247--256.
[2]
Dawei Chen, Zhixu Li, Binbin Gu, and Zhigang Chen. 2021b. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. In DASFAA 2021, Taipei, Taiwan, April 11--14, 2021, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12682). Springer, 186--201.
[3]
Shuguang Chen, Gustavo Aguilar, Leonardo Neves, and Thamar Solorio. 2021a. Can images help recognize entities? A study of the role of images for Multimodal NER. In W-NUT 2021, Online, November 11, 2021. Association for Computational Linguistics, 87--96.
[4]
Jason P. C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguistics, Vol. 4 (2016), 357--370.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171--4186.
[6]
Xinzhi Dong, Chengjiang Long, Wenju Xu, and Chunxia Xiao. 2021. Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning. In MM '21, Virtual Event, China, October 20 - 24, 2021. ACM, 2615--2624.
[7]
Octavian-Eugen Ganea and Thomas Hofmann. 2017. Deep Joint Entity Disambiguation with Local Neural Attention. In EMNLP 2017, Copenhagen, Denmark, September 9--11, 2017, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, 2619--2629.
[8]
Marco Gori, Gabriele Monfardini, and Franco Scarselli. 2005. A new model for learning in graph domains. In Proceedings. 2005 IEEE international joint conference on neural networks, Vol. 2. 729--734.
[9]
Kaiming He, Georgia Gkioxari, Piotr Dollá r, and Ross B. Girshick. 2020. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 42, 2 (2020), 386--397.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 770--778.
[11]
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, Vol. abs/1207.0580 (2012). showeprint[arXiv]1207.0580
[12]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR, Vol. abs/1508.01991 (2015). showeprint[arXiv]1508.01991
[13]
Jun'ichi Kazama and Kentaro Torisawa. 2007. Exploiting Wikipedia as External Knowledge for Named Entity Recognition. In EMNLP-CoNLL 2007, June 28--30, 2007, Prague, Czech Republic. ACL, 698--707.
[14]
Zaid Khan and Yun Fu. 2021. Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation. In MM '21, Virtual Event, China, October 20 - 24, 2021. ACM, 3034--3042.
[15]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings.
[16]
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings. OpenReview.net.
[17]
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL HLT 2016, San Diego California, USA, June 12--17, 2016. The Association for Computational Linguistics, 260--270.
[18]
Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual Attention Model for Name Tagging in Multimodal Social Media. In ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers. Association for Computational Linguistics, 1990--1999.
[19]
Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In ACL 2016, August 7--12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
[20]
Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal Named Entity Recognition for Short Social Media Posts. In NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1--6, 2018, Volume 1 (Long Papers). Association for Computational Linguistics, 852--860.
[21]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada. 91--99.
[22]
Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. Representing Text Chunks. In EACL 1999, June 8--12, 1999, University of Bergen, Bergen, Norway. The Association for Computer Linguistics, 173--179.
[23]
Chengai Sun, Liangyu Lv, Gang Tian, and Tailu Liu. 2021a. Deep Interactive Memory Network for Aspect-Level Sentiment Analysis. ACM Trans. Asian Low Resour. Lang. Inf. Process., Vol. 20, 1 (2021), 3:1--3:12.
[24]
Lin Sun, Jiquan Wang, Yindu Su, Fangsheng Weng, Yuxuan Sun, Zengwei Zheng, and Yuanyi Chen. 2020. RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER. In COLING 2020, Barcelona, Spain (Online), December 8--13, 2020. International Committee on Computational Linguistics, 1852--1862.
[25]
Lin Sun, Jiquan Wang, Kai Zhang, Yindu Su, and Fangsheng Weng. 2021b. RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER. In AAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 13860--13868.
[26]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 6558--6569.
[27]
Sté phan Tulkens and Andreas van Cranenburgh. 2020. Embarrassingly Simple Unsupervised Aspect Extraction. In ACL 2020, Online, July 5--10, 2020. Association for Computational Linguistics, 3182--3187.
[28]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA. 5998--6008.
[29]
Petar Velivc ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In International Conference on Learning Representations.
[30]
Hanqian Wu, Siliang Cheng, Jingjing Wang, Shoushan Li, and Lian Chi. 2020a. Multimodal Aspect Extraction with Region-Aware Alignment Network. In NLPCC 2020, Zhengzhou, China, October 14--18, 2020, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 12430). Springer, 145--156.
[31]
Zhiwei Wu, Changmeng Zheng, Yi Cai, Junying Chen, Ho-fung Leung, and Qing Li. 2020b. Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts. In MM '20, Virtual Event / Seattle, WA, USA, October 12--16, 2020. ACM, 1038--1046.
[32]
Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design Challenges and Misconceptions in Neural Sequence Labeling. In COLING 2018, Santa Fe, New Mexico, USA, August 20--26, 2018. Association for Computational Linguistics, 3879--3889.
[33]
Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia. 2020. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. In ACL 2020, Online, July 5--10, 2020. Association for Computational Linguistics, 3342--3352.
[34]
Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2002. Kernel Methods for Relation Extraction. In EMNLP 2002, Philadelphia, PA, USA, July 6--7, 2002. 71--78.
[35]
Dong Zhang, Xincheng Ju, Wei Zhang, Junhui Li, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2021a. Multi-modal Multi-label Emotion Recognition with Heterogeneous Hierarchical Message Passing. In AAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 14338--14346.
[36]
Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. 2021b. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In AAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 14347--14355.
[37]
Min Zhang, Guodong Zhou, Lingpeng Yang, and Dong-Hong Ji. 2006. Chinese Word Segmentation and Named Entity Recognition Based on a Context-Dependent Mutual Information Independence Model. In SIGHAN@COLING/ACL 2006, Sydney, Australia, July 22--23, 2006. Association for Computational Linguistics, 154--157.
[38]
Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive Co-attention Network for Named Entity Recognition in Tweets. In (AAAI-18), New Orleans, Louisiana, USA, February 2--7, 2018. AAAI Press, 5674--5681.
[39]
Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-Aware BERT for Language Understanding. In AAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 9628--9635.
[40]
Changmeng Zheng, Zhiwei Wu, Tao Wang, Yi Cai, and Qing Li. 2021. Object-Aware Multimodal Named Entity Recognition in Social Media Posts With Adversarial Learning. IEEE Trans. Multim., Vol. 23 (2021), 2520--2532.
[41]
Guodong Zhou and Jian Su. 2005. Machine learning-based named entity recognition via effective integration of various evidences. Nat. Lang. Eng. (2005).

Cited By

View all
  • (2024)Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity RecognitionApplied Sciences10.3390/app1406233314:6(2333)Online publication date: 10-Mar-2024
  • (2024)GDN-CMCF: A Gated Disentangled Network With Cross-Modality Consensus Fusion for Multimodal Named Entity RecognitionIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.332340211:3(3944-3954)Online publication date: Jun-2024
  • (2024)Dynamic Graph Construction Framework for Multimodal Named Entity Recognition in Social MediaIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.330302711:2(2513-2522)Online publication date: Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. conditional random field
  2. graph convolutional network
  3. multi-head attention
  4. multimodal named entity recognition

Qualifiers

  • Research-article

Funding Sources

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)284
  • Downloads (Last 6 weeks)15
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity RecognitionApplied Sciences10.3390/app1406233314:6(2333)Online publication date: 10-Mar-2024
  • (2024)GDN-CMCF: A Gated Disentangled Network With Cross-Modality Consensus Fusion for Multimodal Named Entity RecognitionIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.332340211:3(3944-3954)Online publication date: Jun-2024
  • (2024)Dynamic Graph Construction Framework for Multimodal Named Entity Recognition in Social MediaIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.330302711:2(2513-2522)Online publication date: Apr-2024
  • (2024)Text-Image Scene Graph Fusion for Multimodal Named Entity RecognitionIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33264165:6(2828-2839)Online publication date: Jun-2024
  • (2024)Chinese Multimodal Named Entity Recognition in Conversational Scenarios2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT)10.1109/AINIT61980.2024.10581703(1596-1601)Online publication date: 29-Mar-2024
  • (2024)DGHC: A Hybrid Algorithm for Multi-Modal Named Entity Recognition Using Dynamic Gating and Correlation Coefficients With Visual EnhancementsIEEE Access10.1109/ACCESS.2024.340025012(69151-69162)Online publication date: 2024
  • (2024)GNN-Based Multimodal Named Entity RecognitionThe Computer Journal10.1093/comjnl/bxae03067:8(2622-2632)Online publication date: 6-Apr-2024
  • (2024)Evolving to multi-modal knowledge graphs for engineering design: state-of-the-art and future challengesJournal of Engineering Design10.1080/09544828.2023.2301230(1-40)Online publication date: 6-Jan-2024
  • (2024)Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidanceThe Journal of Supercomputing10.1007/s11227-024-06347-8Online publication date: 22-Jul-2024
  • (2024)MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognitionMultimedia Tools and Applications10.1007/s11042-024-18472-w83:28(71639-71663)Online publication date: 8-Feb-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media