research-article

Joint learning for biomedical NER and entity normalization: encoding schemes, counterfactual examples, and zero-shot evaluation

Authors:

Ramakanth KavuluruAuthors Info & Claims

BCB '21: Proceedings of the 12th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Article No.: 55, Pages 1 - 10

https://doi.org/10.1145/3459930.3469533

Published: 01 August 2021 Publication History

Abstract

Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing them to concepts in standard terminologies or thesauri (e.g., Entrez, ICD-10, or RxNorm) is crucial for identifying more informative relations among them that drive disease etiology, progression, and treatment. In this effort we pursue two high level strategies to improve biomedical ER and EN. The first is to decouple standard entity encoding tags (e.g., "B-Drug" for the beginning of a drug) into type tags (e.g., "Drug") and positional tags (e.g., "B"). A second strategy is to use additional counterfactual training examples to handle the issue of models learning spurious correlations between surrounding context and normalized concepts in training data. We conduct elaborate experiments using the MedMentions dataset, the largest dataset of its kind for ER and EN in biomedicine. We find that our first strategy performs better in entity normalization when compared with the standard coding scheme. The second data augmentation strategy uniformly improves performance in span detection, typing, and normalization. The gains from counterfactual examples are more prominent when evaluating in zero-shot settings, for concepts that have never been encountered during training.

References

[1]

Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3606--3611, 2019.

[2]

Eckhard Bick. A named entity recognizer for danish. In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC2000, pages 305--308, 2004.

[3]

Michael Collins. Ranking algorithms for named entity extraction: Boosting and the votedperceptron. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 489--496, 2002.

Digital Library

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171--4186, 2019.

[5]

Jörg Hakenberg, Martin Gerner, Maximilian Haeussler, Illés Solt, Conrad Plake, Michael Schroeder, Graciela Gonzalez, Goran Nenadic, and Casey M Bergman. The gnat library for local and remote gene mention normalization. Bioinformatics, 27(19):2769--2771, 2011.

Digital Library

[6]

Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.

[7]

Nikolaos Kolitsas, Octavian-Eugen Ganea, and Thomas Hofmann. End-to-end neural entity linking. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 519--529, 2018.

[8]

R Leaman and Z Lu. TaggerOne: joint named entity recognition and normalization with semi-markov models. Bioinformatics (Oxford, England), 32(18):2839--2846, 2016.

[9]

Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. End-to-end neural coreference resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 188--197, 2017.

[10]

T Lin, P Goyal, R Girshick, K He, and P Dollar. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999--3007, 2017.

[11]

Liyuan Liu, Jingbo Shang, Xiang Ren, Frank Xu, Huan Gui, Jian Peng, and Jiawei Han. Empower sequence labeling with task-aware neural language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

[12]

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

[13]

Daniel Loureiro and Alípio Mário Jorge. MedLinker: Medical entity linking with neural representations and dictionary matching. In European Conference on Information Retrieval, pages 230--237. Springer, 2020.

Digital Library

[14]

David D McDonald. Internal and external evidence in the identification and semantic categorization of proper names. In Acquisition of Lexical Knowledge from Text, 1993.

[15]

Sunil Mohan and Donghui Li. Medmentions: A large biomedical corpus annotated with umls concepts. In Automated Knowledge Base Construction (AKBC), 2018.

[16]

Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip HS Torr, and Puneet K Dokania. Calibrating deep neural networks using focal loss. arXiv preprint arXiv:2002.09437, 2020.

[17]

Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. ScispaCy: Fast and robust models for biomedical natural language processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 319--327, 2019.

[18]

Naoaki Okazaki and Jun'ichi Tsujii. Simple and efficient algorithm for approximate dictionary matching. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 851--859, 2010.

[19]

Lisa F Rau. Extracting company names from text. In Proceedings the Seventh IEEE Conference on Artificial Intelligence Application, pages 29--30. IEEE Computer Society, 1991.

[20]

Erik Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142--147, 2003.

Digital Library

[21]

Luca Soldaini and Nazli Goharian. QuickUMLS: a fast, unsupervised approach for medical concept extraction. In Proceedings of the MedIR Workshop at SIGIR 2016.

[22]

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261--272, 2020.

[23]

Xi Wang, Jiagao Lyu, Li Dong, and Ke Xu. Multitask learning for biomedical named entity recognition with cross-sharing structure. BMC bioinformatics, 20 (1):427, 2019.

[24]

Chih-Hsuan Wei and Hung-Yu Kao. Cross-species gene normalization by species inference. BMC bioinformatics, 12(S8):S5, 2011.

[25]

Maciej Wiatrak and Juha Iso-Sipila. Simple hierarchical multi-task neural end-to-end entity linking for biomedical text. In Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, pages 12--17, 2020.

[26]

Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. Scalable zero-shot entity linking with dense entity retrieval. arXiv preprint arXiv:1911.03814, 2019.

[27]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

[28]

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. Joint learning of the embedding of words and entities for named entity disambiguation. In 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, pages 250--259. Association for Computational Linguistics (ACL), 2016.

[29]

Xiangji Zeng, Yunliang Li, Yuchen Zhai, and Yin Zhang. Counterfactual generator: A weakly-supervised method for named entity recognition. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7270--7280, 2020.

[30]

Rui Zhang, Cicero dos Santos, Michihiro Yasunaga, Bing Xiang, and Dragomir Radev. Neural coreference resolution with deep biaffine attention by joint mention detection and mention clustering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 102--107, 2018.

Cited By

Báez PCampillos-Llanos LNúñez FDunstan J(2024)Entity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitationsLanguage Resources and Evaluation10.1007/s10579-024-09755-7Online publication date: 2-Jul-2024
https://doi.org/10.1007/s10579-024-09755-7
An B(2023)Construction and application of Chinese breast cancer knowledge graph based on multi-source heterogeneous dataMathematical Biosciences and Engineering10.3934/mbe.202329220:4(6776-6799)Online publication date: 2023
https://doi.org/10.3934/mbe.2023292
Sezgin EHussain SRust SHuang Y(2023)Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world DataJMIR Formative Research10.2196/430147(e43014)Online publication date: 7-Mar-2023
https://doi.org/10.2196/43014
Show More Cited By

Index Terms

Joint learning for biomedical NER and entity normalization: encoding schemes, counterfactual examples, and zero-shot evaluation
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
  2. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Disease named entity recognition and normalization with DNorm
BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

Automated techniques for locating and identifying key biomedical entities such as diseases in biomedical publications have a wide range of applications, including semantic literature indexing, biocuration support and knowledge discovery. Machine ...
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BCB '21: Proceedings of the 12th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

August 2021

603 pages

ISBN:9781450384506

DOI:10.1145/3459930

General Chairs:
Hongmei Jiang
Northwestern University
,
Xiuzhen Huang
Arkansas State University
,
Jiajie Zhang
The University of Texas Health Science Center at Houston

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGBIOM: ACM Special Interest Group on Biomedical Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. National Library of Medicine

Conference

BCB '21

Sponsor:

SIGBIOM

BCB '21: 12th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

August 1 - 4, 2021

Florida, Gainesville

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
279
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)5

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Báez PCampillos-Llanos LNúñez FDunstan J(2024)Entity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitationsLanguage Resources and Evaluation10.1007/s10579-024-09755-7Online publication date: 2-Jul-2024
https://doi.org/10.1007/s10579-024-09755-7
An B(2023)Construction and application of Chinese breast cancer knowledge graph based on multi-source heterogeneous dataMathematical Biosciences and Engineering10.3934/mbe.202329220:4(6776-6799)Online publication date: 2023
https://doi.org/10.3934/mbe.2023292
Sezgin EHussain SRust SHuang Y(2023)Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world DataJMIR Formative Research10.2196/430147(e43014)Online publication date: 7-Mar-2023
https://doi.org/10.2196/43014
Diaz Gonzalez AHughes KYue SHayes S(2023)Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical LiteratureProceedings of the 2023 7th International Conference on Information System and Data Mining10.1145/3603765.3603771(37-42)Online publication date: 10-May-2023
https://dl.acm.org/doi/10.1145/3603765.3603771
Gou YJie C(2023)A lightweight biomedical named entity recognition with pre-trained model2023 IEEE 3rd International Conference on Data Science and Computer Application (ICDSCA)10.1109/ICDSCA59871.2023.10392374(117-121)Online publication date: 27-Oct-2023
https://doi.org/10.1109/ICDSCA59871.2023.10392374
Zan DWang SZhang HYan YWu WGuan BWang Y(2022)SQL: Retrieval Augmented Zero-Shot Question Answering over Knowledge GraphAdvances in Knowledge Discovery and Data Mining10.1007/978-3-031-05981-0_18(223-236)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1007/978-3-031-05981-0_18

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents