Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3459930.3469533acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

Joint learning for biomedical NER and entity normalization: encoding schemes, counterfactual examples, and zero-shot evaluation

Published: 01 August 2021 Publication History

Abstract

Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing them to concepts in standard terminologies or thesauri (e.g., Entrez, ICD-10, or RxNorm) is crucial for identifying more informative relations among them that drive disease etiology, progression, and treatment. In this effort we pursue two high level strategies to improve biomedical ER and EN. The first is to decouple standard entity encoding tags (e.g., "B-Drug" for the beginning of a drug) into type tags (e.g., "Drug") and positional tags (e.g., "B"). A second strategy is to use additional counterfactual training examples to handle the issue of models learning spurious correlations between surrounding context and normalized concepts in training data. We conduct elaborate experiments using the MedMentions dataset, the largest dataset of its kind for ER and EN in biomedicine. We find that our first strategy performs better in entity normalization when compared with the standard coding scheme. The second data augmentation strategy uniformly improves performance in span detection, typing, and normalization. The gains from counterfactual examples are more prominent when evaluating in zero-shot settings, for concepts that have never been encountered during training.

References

[1]
Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3606--3611, 2019.
[2]
Eckhard Bick. A named entity recognizer for danish. In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC2000, pages 305--308, 2004.
[3]
Michael Collins. Ranking algorithms for named entity extraction: Boosting and the votedperceptron. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 489--496, 2002.
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171--4186, 2019.
[5]
Jörg Hakenberg, Martin Gerner, Maximilian Haeussler, Illés Solt, Conrad Plake, Michael Schroeder, Graciela Gonzalez, Goran Nenadic, and Casey M Bergman. The gnat library for local and remote gene mention normalization. Bioinformatics, 27(19):2769--2771, 2011.
[6]
Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
[7]
Nikolaos Kolitsas, Octavian-Eugen Ganea, and Thomas Hofmann. End-to-end neural entity linking. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 519--529, 2018.
[8]
R Leaman and Z Lu. TaggerOne: joint named entity recognition and normalization with semi-markov models. Bioinformatics (Oxford, England), 32(18):2839--2846, 2016.
[9]
Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. End-to-end neural coreference resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 188--197, 2017.
[10]
T Lin, P Goyal, R Girshick, K He, and P Dollar. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999--3007, 2017.
[11]
Liyuan Liu, Jingbo Shang, Xiang Ren, Frank Xu, Huan Gui, Jian Peng, and Jiawei Han. Empower sequence labeling with task-aware neural language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
[12]
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[13]
Daniel Loureiro and Alípio Mário Jorge. MedLinker: Medical entity linking with neural representations and dictionary matching. In European Conference on Information Retrieval, pages 230--237. Springer, 2020.
[14]
David D McDonald. Internal and external evidence in the identification and semantic categorization of proper names. In Acquisition of Lexical Knowledge from Text, 1993.
[15]
Sunil Mohan and Donghui Li. Medmentions: A large biomedical corpus annotated with umls concepts. In Automated Knowledge Base Construction (AKBC), 2018.
[16]
Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip HS Torr, and Puneet K Dokania. Calibrating deep neural networks using focal loss. arXiv preprint arXiv:2002.09437, 2020.
[17]
Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. ScispaCy: Fast and robust models for biomedical natural language processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 319--327, 2019.
[18]
Naoaki Okazaki and Jun'ichi Tsujii. Simple and efficient algorithm for approximate dictionary matching. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 851--859, 2010.
[19]
Lisa F Rau. Extracting company names from text. In Proceedings the Seventh IEEE Conference on Artificial Intelligence Application, pages 29--30. IEEE Computer Society, 1991.
[20]
Erik Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142--147, 2003.
[21]
Luca Soldaini and Nazli Goharian. QuickUMLS: a fast, unsupervised approach for medical concept extraction. In Proceedings of the MedIR Workshop at SIGIR 2016.
[22]
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261--272, 2020.
[23]
Xi Wang, Jiagao Lyu, Li Dong, and Ke Xu. Multitask learning for biomedical named entity recognition with cross-sharing structure. BMC bioinformatics, 20 (1):427, 2019.
[24]
Chih-Hsuan Wei and Hung-Yu Kao. Cross-species gene normalization by species inference. BMC bioinformatics, 12(S8):S5, 2011.
[25]
Maciej Wiatrak and Juha Iso-Sipila. Simple hierarchical multi-task neural end-to-end entity linking for biomedical text. In Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, pages 12--17, 2020.
[26]
Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. Scalable zero-shot entity linking with dense entity retrieval. arXiv preprint arXiv:1911.03814, 2019.
[27]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[28]
Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. Joint learning of the embedding of words and entities for named entity disambiguation. In 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, pages 250--259. Association for Computational Linguistics (ACL), 2016.
[29]
Xiangji Zeng, Yunliang Li, Yuchen Zhai, and Yin Zhang. Counterfactual generator: A weakly-supervised method for named entity recognition. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7270--7280, 2020.
[30]
Rui Zhang, Cicero dos Santos, Michihiro Yasunaga, Bing Xiang, and Dragomir Radev. Neural coreference resolution with deep biaffine attention by joint mention detection and mention clustering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 102--107, 2018.

Cited By

View all
  • (2024)Entity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitationsLanguage Resources and Evaluation10.1007/s10579-024-09755-7Online publication date: 2-Jul-2024
  • (2023)Construction and application of Chinese breast cancer knowledge graph based on multi-source heterogeneous dataMathematical Biosciences and Engineering10.3934/mbe.202329220:4(6776-6799)Online publication date: 2023
  • (2023)Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world DataJMIR Formative Research10.2196/430147(e43014)Online publication date: 7-Mar-2023
  • Show More Cited By

Index Terms

  1. Joint learning for biomedical NER and entity normalization: encoding schemes, counterfactual examples, and zero-shot evaluation
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Information & Contributors

              Information

              Published In

              cover image ACM Conferences
              BCB '21: Proceedings of the 12th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
              August 2021
              603 pages
              ISBN:9781450384506
              DOI:10.1145/3459930
              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Sponsors

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              Published: 01 August 2021

              Permissions

              Request permissions for this article.

              Check for updates

              Author Tags

              1. biomedical natural language processing
              2. deep neural networks
              3. entity normalization
              4. information extraction
              5. named entity recognition

              Qualifiers

              • Research-article

              Funding Sources

              Conference

              BCB '21
              Sponsor:

              Acceptance Rates

              Overall Acceptance Rate 254 of 885 submissions, 29%

              Contributors

              Other Metrics

              Bibliometrics & Citations

              Bibliometrics

              Article Metrics

              • Downloads (Last 12 months)107
              • Downloads (Last 6 weeks)5
              Reflects downloads up to 11 Jan 2025

              Other Metrics

              Citations

              Cited By

              View all
              • (2024)Entity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitationsLanguage Resources and Evaluation10.1007/s10579-024-09755-7Online publication date: 2-Jul-2024
              • (2023)Construction and application of Chinese breast cancer knowledge graph based on multi-source heterogeneous dataMathematical Biosciences and Engineering10.3934/mbe.202329220:4(6776-6799)Online publication date: 2023
              • (2023)Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world DataJMIR Formative Research10.2196/430147(e43014)Online publication date: 7-Mar-2023
              • (2023)Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical LiteratureProceedings of the 2023 7th International Conference on Information System and Data Mining10.1145/3603765.3603771(37-42)Online publication date: 10-May-2023
              • (2023)A lightweight biomedical named entity recognition with pre-trained model2023 IEEE 3rd International Conference on Data Science and Computer Application (ICDSCA)10.1109/ICDSCA59871.2023.10392374(117-121)Online publication date: 27-Oct-2023
              • (2022)SQL: Retrieval Augmented Zero-Shot Question Answering over Knowledge GraphAdvances in Knowledge Discovery and Data Mining10.1007/978-3-031-05981-0_18(223-236)Online publication date: 16-May-2022

              View Options

              Login options

              View options

              PDF

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              Media

              Figures

              Other

              Tables

              Share

              Share

              Share this Publication link

              Share on social media