Google Scholar

MAF: a general matching and alignment framework for multimodal named entity recognition

B Xu, S Huang, C Sha, H Wang - … conference on web search and data …, 2022 - dl.acm.org

B Xu, S Huang, C Sha, H Wang

Proceedings of the fifteenth ACM international conference on web search and …, 2022•dl.acm.org

In this paper, we study multimodal named entity recognition in social media posts. Existing works mainly focus on using a cross-modal attention mechanism to combine text representation with image representation. However, they still suffer from two weaknesses: (1) the current methods are based on a strong assumption that each text and its accompanying image are matched, and the image can be used to help identify named entities in the text. However, this assumption is not always true in real scenarios, and the strong assumption may reduce the recognition effect of theMNER model; (2) the current methods fail to construct a consistent representation to bridge the semantic gap between two modalities, which prevents the model from establishing a good connection between the text and image. To address these issues, we propose a general matching and alignment framework (MAF) for multimodal named entity recognition in social media posts. Specifically, to solve the first issue, we propose a novel cross-modal matching (CM) module to calculate the similarity score between text and image, and use the score to determine the proportion of visual information that should be retained. To solve the second issue, we propose a novel cross-modal alignment (CA) module to make the representations of the two modalities more consistent. We conduct extensive experiments, ablation studies, and case studies to demonstrate the effectiveness and efficiency of our method.The source code of this paper can be found in https://github.com/xubodhu/MAF.

ACM Digital Library

Show moreShow less

Save Cite Cited by 62 Related articles All 3 versions

Cite

Advanced search

Saved to My library

MAF: a general matching and alignment framework for multimodal named entity recognition