Article

A Framework for False Negative Detection in NER/NEL

Authors:

Álvaro Abella-Bascarán,

Paula Chocrón,

Gabriel de MaeztuAuthors Info & Claims

Natural Language Processing and Information Systems: 27th International Conference on Applications of Natural Language to Information Systems, NLDB 2022, Valencia, Spain, June 15–17, 2022, Proceedings

Pages 323 - 330

https://doi.org/10.1007/978-3-031-08473-7_30

Published: 15 June 2022 Publication History

Abstract

Finding the false negatives of a NER/NEL system is fundamental to improve it, and is usually done by manual annotation of texts. However, in an environment with a huge volume of unannotated texts (e.g. a hospital) and a low frequency of positives (e.g. a mention of a particular disease in the clinical notes) the task becomes very inefficient. This paper presents a framework to tackle this problem: given an existing NER/NEL system, we propose a technique consisting of using text similarity search to rank texts by probability of containing false negatives of a given concept, using as a query those texts where the existing NER/NEL system has found positives of this concept. We formulate text similarity as a function of shared medical entities between texts, and we re-purpose an existing public dataset (CodiEsp) to propose an evaluation strategy.

References

[1]

Alodadi, M., Janeja, V.P.: Similarity in patient support forums using TF-IDF and cosine similarity metrics. In: 2015 International Conference on Healthcare Informatics, pp. 521–522 (2015).

[2]

Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: ICLR (2017)

[3]

Aryal, S., Ting, K.M., Washio, T., Haffari, G.: A new simple and effective measure for bag-of-word inter-document similarity measurement. arXiv preprint arXiv:1902.03402 (2019)

[4]

Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805

[5]

Farouk, M.: Measuring sentences similarity: a survey. CoRR abs/1910.03940 (2019). http://arxiv.org/abs/1910.03940

[6]

Gao M, Li T, Huang P, et al. Liu X et al. Text classification research based on improved Word2vec and CNN Service-Oriented Computing – ICSOC 2018 Workshops 2019 Cham Springer 126-135

[7]

Gupta, V., Saw, A., Nokhiz, P., Netrapalli, P., Rai, P., Talukdar, P.P.: P-SIF: document embeddings using partition averaging. CoRR abs/2005.09069 (2020). https://arxiv.org/abs/2005.09069

[8]

Jang, B., Kim, M., Harerimana, G., Kang, S.U., Kim, J.W.: Bi-LSTM model to increase accuracy in text classification: combining Word2vec CNN and attention mechanism. Appl. Sci. 10(17) (2020)., https://www.mdpi.com/2076-3417/10/17/5841

[9]

Kadhim, A.I., Cheah, Y.N., Ahamed, N.H.: Text document preprocessing and dimension reduction techniques for text document clustering. In: 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology, pp. 69–73 (2014).

[10]

Kadhim, A.I., Cheah, Y.N., Hieder, I.A., Ali, R.A.: Improving TF-IDF with Singular Value Decomposition (SVD) for feature extraction on Twitter (2017)

[11]

Lahitani, A.R., Permanasari, A.E., Setiawan, N.A.: Cosine similarity to determine similarity measure: study case in online essay assessment. In: 2016 4th International Conference on Cyber and IT Service Management, pp. 1–6 (2016).

[12]

Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning (2014)

[13]

Li, B., Zhou, H., He, J., Wang, M., Yang, Y., Li, L.: On the sentence embeddings from pre-trained language models. CoRR abs/2011.05864 (2020). https://arxiv.org/abs/2011.05864

[14]

Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)

[15]

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546 (2013), http://arxiv.org/abs/1310.4546

[16]

Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview of automatic clinical coding: annotations, guidelines, and solutions for non-English clinical cases at codiesp track of CLEF eHealth 2020. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2020)

[17]

Sato, R., Yamada, M., Kashima, H.: Re-evaluating word mover’s distance. CoRR abs/2105.14403 (2021). https://arxiv.org/abs/2105.14403

[18]

Schmidt, C.W.: Improving a TF-IDF weighted document vector embedding. CoRR abs/1902.09875 (2019). http://arxiv.org/abs/1902.09875

[19]

Tata S and Patel JM Estimating the selectivity of TF-IDF based cosine similarity predicates ACM SIGMOD Rec. 2007 36 2 7-12

[20]

Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Towards universal paraphrastic sentence embeddings. CoRR abs/1511.08198 (2016)

[21]

Wu, L., et al.: Word mover’s embedding: from Word2Vec to document embedding. CoRR abs/1811.01713 (2018). http://arxiv.org/abs/1811.01713

Index Terms

A Framework for False Negative Detection in NER/NEL

Index terms have been assigned to the content through auto-classification.

Recommendations

Named Entity Linking in English-Czech Parallel Corpus
Text, Speech, and Dialogue
Abstract
We present a procedure to build relatively quickly new resources with annotated named entities and their linking to Wikidata. First, we applied state-of-the-art models for named entity recognition on a sentence-aligned parallel English-Czech ...
What's missing in geographical parsing?

Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information in many real-world applications such as ...
Named entity recognition an aid to improve multilingual entity filling in language-independent approach
IKM4DR '12: Proceedings of the first workshop on Information and knowledge management for developing region

This paper details the approach to identify Named Entities (NEs) from a large non-English corpus and associate them with appropriate tags, requiring minimal human intervention and no linguistic expertise. The main objective in this paper is to focus on ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Natural Language Processing and Information Systems: 27th International Conference on Applications of Natural Language to Information Systems, NLDB 2022, Valencia, Spain, June 15–17, 2022, Proceedings

Jun 2022

529 pages

ISBN:978-3-031-08472-0

DOI:10.1007/978-3-031-08473-7

Editors:
Paolo Rosso
Universitat Politècnica de València, Valencia, Spain
,
Valerio Basile
University of Turin, Torino, Italy
,
Raquel Martínez
Universidad Nacional de Educación a Distancia, Madrid, Spain
,
Elisabeth Métais
Conservatoire National des Arts et Métiers, Paris, France
,
Farid Meziane
University of Derby, Derby, UK

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 15 June 2022

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents