research-article

Fine-Grained Multimodal Named Entity Recognition and Grounding with a Generative Framework

Authors:

Jieming Wang,

Ziyan Li,

Jianfei Yu,

Li Yang, and

Rui XiaAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

Pages 3934 - 3943

https://doi.org/10.1145/3581783.3612322

Published: 27 October 2023 Publication History

Get Access

Abstract

Multimodal Named Entity Recognition (MNER) aims to locate and classify named entities mentioned in a pair of text and image. However, most previous MNER works focus on extracting entities in the form of text but failing to ground text symbols to their corresponding visual objects. Moreover, existing MNER studies primarily classify entities into four coarse-grained entity types, which are often insufficient to map them to their real-world referents. To solve these limitations, we introduce a task named Fine-grained Multimodal Named Entity Recognition and Grounding (FMNERG) in this paper, which aims to simultaneously extract named entities in text, their fine-grained entity types, and their grounded visual objects in image. Moreover, we construct a Twitter dataset for the FMNERG task, and further propose a T5-based multImodal GEneration fRamework (TIGER), which formulates FMNERG as a generation problem by converting all the entity-type-object triples into a target sequence and adapts a pre-trained sequence-to-sequence model T5 to directly generate the target sequence from an image-text input pair. Experimental results demonstrate that TIGER performs significantly better than a number of baseline systems on the annotated Twitter dataset. Our dataset annotation and source code are publicly released at https://github.com/NUSTM/FMNERG.

Supplementary Material

MP4 File (mmfp2752-video.mp4)

Presentation video - short version. This presentation introduces a novel task in multimodal fine-grained named entity recognition, called FMNERG (Fine-grained Multimodal Named Entity Recognition with Grounding). FMNERG aims to extract entities, their fine-grained categories, and corresponding visual regions in a given image-text pair simultaneously. The introduction of the FMNERG task contributes to the automated construction of large-scale multimodal knowledge graphs. This work formulates the FMNERG task as a sentence paraphrasing task and proposes a generative framework named TIGER (Text-Image Grounding with Entity Rewriting) based on the T5 model. Experimental results demonstrate that TIGER significantly outperforms existing sequence-labeling-based multimodal named entity recognition methods on the FMNERG task.

Download
578.63 MB

References

[1]

Timothy Baldwin, Marie-Catherine de Marneffe, Bo Han, Young-Bum Kim, Alan Ritter, and Wei Xu. 2015. Shared tasks of the 2015 workshop on noisy usergenerated text: Twitter lexical normalization and named entity recognition. In Proceedings of the Workshop on Noisy User-generated Text. 126--135.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Fine-Grained named entity recognition using conditional random fields for question answering

Fine-grained Dutch named entity recognition

Exploring entity relations for named entity disambiguation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations