On the Strength of Character Language Models for Multilingual Named Entity Recognition

Yu, Xiaodong; Mayhew, Stephen; Sammons, Mark; Roth, Dan

Computer Science > Computation and Language

arXiv:1809.05157 (cs)

[Submitted on 13 Sep 2018 (v1), last revised 20 Sep 2018 (this version, v2)]

Title:On the Strength of Character Language Models for Multilingual Named Entity Recognition

Authors:Xiaodong Yu, Stephen Mayhew, Mark Sammons, Dan Roth

View PDF

Abstract:Character-level patterns have been widely used as features in English Named Entity Recognition (NER) systems. However, to date there has been no direct investigation of the inherent differences between name and non-name tokens in text, nor whether this property holds across multiple languages. This paper analyzes the capabilities of corpus-agnostic Character-level Language Models (CLMs) in the binary task of distinguishing name tokens from non-name tokens. We demonstrate that CLMs provide a simple and powerful model for capturing these differences, identifying named entity tokens in a diverse set of languages at close to the performance of full NER systems. Moreover, by adding very simple CLM-based features we can significantly improve the performance of an off-the-shelf NER system for multiple languages.

Comments:	5 pages, EMNLP 2018 short paper
Subjects:	Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:	arXiv:1809.05157 [cs.CL]
	(or arXiv:1809.05157v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1809.05157
Journal reference:	EMNLP 2018

Submission history

From: Xiaodong Yu [view email]
[v1] Thu, 13 Sep 2018 20:01:20 UTC (700 KB)
[v2] Thu, 20 Sep 2018 17:10:03 UTC (709 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2018-09

Change to browse by:

cs
cs.IR

References & Citations

DBLP - CS Bibliography

listing | bibtex

Xiaodong Yu
Stephen D. Mayhew
Mark Sammons
Dan Roth

export BibTeX citation

Computer Science > Computation and Language

Title:On the Strength of Character Language Models for Multilingual Named Entity Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:On the Strength of Character Language Models for Multilingual Named Entity Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators