research-article

A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese

Authors:

Alipio Mario Jorge,

Arian Pasquali,

Catarina Santos,

Mario LopesAuthors Info & Claims

SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

Pages 950 - 956

https://doi.org/10.1145/3555776.3578577

Published: 07 June 2023 Publication History

Abstract

Textual health records of cancer patients are usually protracted and highly unstructured, making it very time-consuming for health professionals to get a complete overview of the patient's therapeutic course. As such limitations can lead to suboptimal and/or inefficient treatment procedures, healthcare providers would greatly benefit from a system that effectively summarizes the information of those records. With the advent of deep neural models, this objective has been partially attained for English clinical texts, however, the research community still lacks an effective solution for languages with limited resources. In this paper, we present the approach we developed to extract procedures, drugs, and diseases from oncology health records written in European Portuguese. This project was conducted in collaboration with the Portuguese Institute for Oncology which, besides holding over 10 years of duly protected medical records, also provided oncologist expertise throughout the development of the project. Since there is no annotated corpus for biomedical entity extraction in Portuguese, we also present the strategy we followed in annotating the corpus for the development of the models. The final models, which combined a neural architecture with entity linking, achieved F₁ scores of 88.6, 95.0, and 55.8 per cent in the mention extraction of procedures, drugs, and diseases, respectively.

References

[1]

Mohamed AlShuweihi, Said A. Salloum, and Khaled F. Shaalan. 2021. Biomedical Corpora and Natural Language Processing on Clinical Text in Languages Other Than English: A Systematic Review. In Recent Advances in Intelligent Systems and Smart Applications.

[2]

Jean Emmanuel Bibault, Philippe Giraud, and Anita Burgun. 2016. Big Data and machine learning in radiation oncology: State of the art and future prospects. Cancer Letters 382, 1 (2016), 110--117.

[3]

Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267--D270.

[4]

Mary Regina Boland, Lena M. Davidson, Silvia P. Canelón, Jessica Meeker, Trevor Penning, John H. Holmes, and Jason H. Moore. 2021. Harnessing electronic health records to study emerging environmental disasters: a proof of concept with perfluoroalkyl substances (PFAS). npj Digital Medicine 4, 1 (aug 2021), 1--10.

[5]

Selen Bozkurt, Rohan Paul, Jean Coquet, Ran Sun, Imon Banerjee, James D. Brooks, and Tina Hernandez-Boussard. 2020. Phenotyping severity of patient-centered outcomes using clinical notes: A prostate cancer use case. Learning Health Systems 4, 4 (oct 2020).

[6]

David Campos, Sérgio Matos, and José Luís Oliveira. 2012. Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).

[8]

Shadi Ebrahimian, Mannudeep K. Kalra, Sheela Agarwal, Bernardo Canedo Bizzo, Mona Elkholy, Christoph Wald, Bibb Allen, and Keith J. Dreyer. 2021. FDA-regulated AI Algorithms: Trends, Strengths, and Gaps of Validation Studies. Academic radiology (2021).

[9]

Guy Fagherazzi. 2020. Deep Digital Phenotyping and Digital Twins for Precision Health: Time to Dig Deeper. Journal of medical Internet research 22, 3 (2020), e16770.

[10]

John Giorgi and Gary D Bader. 2018. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics 34 (2018), 4087--4094.

[11]

Mark L. Graber, Colene Byrne, and Doug Johnston. 2017. The impact of electronic health records on diagnosis., 211--223 pages.

[12]

Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1 (oct 2021). arXiv:2007.15779

Digital Library

[13]

Maryam Habibi, Leon Weber, Mariana L. Neves, D. Wiegandt, and Ulf Leser. 2017. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33 (2017), i37 -- i48.

[14]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9 (1997), 1735--1780.

Digital Library

[15]

Peter B. Jensen, Lars J. Jensen, and Soøren Brunak. 2012. Mining electronic health records: Towards better research applications and clinical care., 395--405 pages.

[16]

Yasmin H. Karimi, Douglas W. Blayney, Allison W. Kurian, Jeanne Shen, Rikiya Yamashita, Daniel Rubin, and Imon Banerjee. 2021. Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data. JCO Clinical Cancer Informatics 5, 5 (dec 2021), 469--478.

[17]

Kenneth L. Kehl, Stefan Groha, Eva M. Lepisto, Haitham Elmarakeby, James Lindsay, Alexander Gusev, Eliezer M. Van Allen, Michael J. Hassett, and Deborah Schrag. 2021. Clinical Inflection Point Detection on the Basis of EHR Data to Identify Clinical Trial-Ready Patients With Cancer. JCO Clinical Cancer Informatics 5, 5 (dec 2021), 622--630.

[18]

Juae Kim, Youngjoong Ko, and Jungyun Seo. 2019. A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains. IEEE Access 7 (2019), 70308--70318.

[19]

John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML.

[20]

Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1 (1989), 541--551.

Digital Library

[21]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (2020), 1234--1240.

[22]

Scott H. Lee. 2018. Natural language generation for electronic health records. npj Digital Medicine 1, 1 (nov 2018), 1--7. arXiv:1806.01353

[23]

Ivan Lerner, Nicolas Paris, and Xavier Tannier. 2020. Terminologies augmented recurrent neural network model for clinical named entity recognition. Journal of Biomedical Informatics 102 (2020), 103356.

Digital Library

[24]

Ulf Leser and Jörg Hakenberg. 2005. What makes a gene name? Named entity recognition in the biomedical literature. Briefings in bioinformatics 6 4 (2005), 357--69.

[25]

Irene Z Li, Michihiro Yasunaga, Muhammed Yavuz Nuzumlali, César Caraballo, Shiwani Mahajan, Harlan M. Krumholz, and Dragomir R. Radev. 2019. A Neural Topic-Attention Model for Medical Term Abbreviation Disambiguation. ArXiv abs/1910.14076 (2019).

[26]

Ke Liu, Omkar Kulkarni, Martin Witteveen-Lane, Bin Chen, and Dave Chesla. 2022. MetBERT: a generalizable and pre-trained deep learning model for the prediction of metastatic cancer from clinical notes. AMIA ... Annual Symposium proceedings. AMIA Symposium 2022 (2022), 331--338. /pmc/articles/PMC9285138//pmc/articles/PMC9285138/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9285138/

[27]

Xiaoxuan Liu, Livia Faes, Aditya Uday Kale, Siegfried Karl Wagner, Dun Jack Fu, Alice Bruynseels, Thushika Mahendiran, Gabriella Moraes, Mohith Shamdas, Christoph Kern, Joseph R. Ledsam, Martin K. Schmid, Konstantinos Balaskas, Eric J. Topol, Lucas M. Bachmann, Pearse A. Keane, and Alastair K. O. Denniston. 2019. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The Lancet. Digital health 1 6 (2019), e271--e297.

[28]

Yue Liu, Tao Ge, Kusum S. Mathews, Heng Ji, and Deborah L. McGuinness. 2015. Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion. ArXiv abs/1804.04225 (2015).

[29]

Fábio Lopes, César Alexandre Teixeira, and Hugo Gonçalo Oliveira. 2019. Contributions to Clinical Named Entity Recognition in Portuguese. In BioNLP@ACL.

[30]

Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bidirectional LSTM-CNNs-CRF. ArXiv abs/1603.01354 (2016).

[31]

Aurélie Névéol, Hercules Dalianis, Guergana K. Savova, and Pierre Zweigenbaum. 2018. Clinical Natural Language Processing in languages other than English: opportunities and challenges. Journal of Biomedical Semantics 9 (2018).

[32]

Denis Newman-Griffis and Ayah Zirikly. 2018. Embedding Transfer for Low-Resource Medical Named Entity Recognition: A Case Study on Patient Mobility. In BioNLP.

[33]

Lance A. Ramshaw and Mitchell P. Marcus. 1995. Text Chunking using Transformation-Based Learning. ArXiv cmp-lg/9505040 (1995).

[34]

Elisa Terumi Rubel Schneider, João Vitor Andrioli de Souza, Julien Knafou, Lucas E. S. Oliveira, Jenny Copara, Yohan Bonescki Gumiel, Lucas Ferro Antunes de Oliveira, Emerson Cabrera Paraiso, Douglas Teodoro, and Claudia Maria Cabral Moro Barra. 2020. BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition. In CLINICALNLP.

[35]

Stefano Silvestri, Francesco Gargiulo, and Mario Ciampi. 2022. Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases. Applied Sciences (2022).

[36]

Luca Soldaini. 2016. QuickUMLS: a Fast, Unsupervised Approach for Medical Concept Extraction.

[37]

Fábio Souza, Rodrigo Nogueira, and Roberto de Alencar Lotufo. 2020. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In BRACIS.

[38]

Inigo Jauregi Unanue, Ehsan Zare Borzeshi, and Massimo Piccardi. 2017. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of biomedical informatics 76 (2017), 102--109.

Digital Library

[39]

Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. ArXiv abs/1706.03762 (2017).

[40]

Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, C. Langlotz, and Jiawei Han. 2019. Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning. Bioinformatics 35 10 (2019), 1745--1752.

[41]

Yanshan Wang, Liwei Wang, Majid Rastegar-Mojarad, Sungrim Moon, Feichen Shen, Naveed Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, Sunghwan Sohn, and Hongfang Liu. 2018. Clinical information extraction applications: A literature review. Journal of biomedical informatics 77 (2018), 34--49.

[42]

Wonjin Yoon, Chan Ho So, Jinhyuk Lee, and Jaewoo Kang. 2019. CollaboNet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinformatics 20 (2019).

[43]

Qiang Zhang, Sheng Zhang, Jianxin Li, Yi Pan, Jing Zhao, Yixing Feng, Yanhui Zhao, Xiaoqing Wang, Zhiming Zheng, Xiangming Yang, Lixia Liu, Chunxin Qin, Ke Zhao, Xiaonan Liu, Caixia Li, Liuyang Zhang, Chunrui Yang, Na Zhuo, Hong Zhang, Jie Liu, Jinglei Gao, Xiaoling Di, Fanbo Meng, Wei Ji, Meng Yang, Xiaojie Xin, Xi Wei, Rui Jin, Lun Zhang, Xudong Wang, Fengju Song, Xiangqian Zheng, Ming Gao, Kexin Chen, and Xiangchun Li. 2022. Improved diagnosis of thyroid cancer aided with deep learning applied to sonographic text reports: a retrospective, multi-cohort, diagnostic study. Cancer Biology and Medicine 19, 5 (may 2022), 733--741.

Cited By

Lopes MMartins HCorreia T(2023)Artificial intelligence and the future in health policy, planning and managementThe International Journal of Health Planning and Management10.1002/hpm.370939:1(3-8)Online publication date: 25-Sep-2023
https://doi.org/10.1002/hpm.3709

Index Terms

A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Document structure
  2. Information systems applications
    1. Data mining
    2. Decision support systems

Recommendations

Electronic health records: how can IS researchers contribute to transforming healthcare?

Electronic health records (EHR) facilitate integration of patient health history for planning safe and proper treatment. Combined with data analytics, aggregate-level EHR enable examination and development of effective medicines and therapies for ...
Designing Patient-Centered Personal Health Records (PHRs): Health Care Professionals' Perspective on Patient-Generated Data

Currently, patients not only want access to various medical records their health care providers keep about them, but they also are willing to become active participants in managing their own health information and the health information of the ones they ...
Mining Electronic Health Records (EHRs): A Survey

The continuously increasing cost of the US healthcare system has received significant attention. Central to the ideas aimed at curbing this trend is the use of technology in the form of the mandate to implement electronic health records (EHRs). EHRs ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

March 2023

1932 pages

ISBN:9781450395175

DOI:10.1145/3555776

Conference Chairs:
Jiman Hong
Soongsil University, South Korea
,
Maart Lanperne
Tallinn University, Estonia
,
Program Chairs:
Juw Won Park
University of Louisville, USA
,
Tomas Cerny
Baylor University, USA
,
Publication Chair:
Hossain Shahriar
Kennesaw State University, USA

Copyright © 2023 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Fundação para a Ciência e a Tecnologia

Conference

SAC '23

Sponsor:

SIGAPP

SAC '23: 38th ACM/SIGAPP Symposium on Applied Computing

March 27 - 31, 2023

Tallinn, Estonia

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
71
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)3

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lopes MMartins HCorreia T(2023)Artificial intelligence and the future in health policy, planning and managementThe International Journal of Health Planning and Management10.1002/hpm.370939:1(3-8)Online publication date: 25-Sep-2023
https://doi.org/10.1002/hpm.3709

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents