research-article

Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical Literature

Authors:

Armando D. Diaz Gonzalez,

Kevin S. Hughes,

Sean T. HayesAuthors Info & Claims

ICISDM '23: Proceedings of the 2023 7th International Conference on Information System and Data Mining

Pages 37 - 42

https://doi.org/10.1145/3603765.3603771

Published: 17 October 2023 Publication History

ICISDM '23: Proceedings of the 2023 7th International Conference on Information System and Data Mining

Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical Literature

Pages 37 - 42

Abstract
References

Abstract

Published biomedical information has and continues to rapidly increase. The recent advancements in Natural Language Processing (NLP), have generated considerable interest in automating the extraction, normalization, and representation of biomedical knowledge about entities such as genes and diseases. Our study analyzes germline abstracts in the construction of knowledge graphs of the immense work that has been done in this area for genes and diseases. This paper presents SimpleGermKG, an automatic knowledge graph construction approach that connects germline genes and diseases. For the extraction of genes and diseases, we employ BioBERT, a pre-trained BERT model on biomedical corpora. We propose an ontology-based and rule-based algorithm to standardize and disambiguate medical terms. For semantic relationships between articles, genes, and diseases, we implemented a part-whole relation approach to connect each entity with its data source and visualize them in a graph-based knowledge representation. Lastly, we discuss the knowledge graph applications, limitations, and challenges to inspire the future research of germline corpora. Our knowledge graph contains 297 genes, 130 diseases, and 46,747 triples. Graph-based visualizations are used to show the results.

References

[1]

J Abreu Vicente. 2022. drAbreu/bioBERT-NER-BC2GM_corpus. https://huggingface.co/drAbreu/bioBERT-NER-BC2GM_corpus

[2]

J Abreu Vicente. 2022. drAbreu/bioBERT-NER-NCBI_disease. https://huggingface.co/drAbreu/bioBERT-NER-NCBI_disease

[3]

Mohammed Ali Al-Garadi, Yuan-Chi Yang, and Abeed Sarker. 2022. The Role of Natural Language Processing during the COVID-19 Pandemic: Health Applications, Opportunities, and Challenges. Healthcare 10, 11 (2022). https://doi.org/10.3390/healthcare10112270

[4]

Tareq Al-Moslmi, Marc Gallofré Ocaña, Andreas L. Opdahl, and Csaba Veres. 2020. Named Entity Extraction for Knowledge Graphs: A Literature Overview. IEEE Access 8 (2020), 32862–32881. https://doi.org/10.1109/ACCESS.2020.2973928

[5]

Basel Alshaikhdeeb and Kamsuriah Ahmad. 2016. Biomedical Named Entity Recognition: A Review. International Journal on Advanced Science, Engineering and Information Technology 6, 6 (2016), 889–895. https://doi.org/10.18517/ijaseit.6.6.1367 Publisher: INSIGHT - Indonesian Society for Knowledge and Human Development.

[6]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. (2019). https://doi.org/10.48550/ARXIV.1903.10676 Publisher: arXiv.

[7]

Roopal Bhatnagar, Sakshi Sardar, Maedeh Beheshti, and Jagdeep T Podichetty. 2022. How can natural language processing help model informed drug development?: a review. JAMIA Open 5, 2 (2022). https://doi.org/10.1093/jamiaopen/ooac043

[8]

Olivier Bodenreider, Joyce Mitchell, and A McCray. 2005. Biomedical ontologies. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 78 (2005), 76–78. https://doi.org/10.1142/9789812704856_0016

[9]

Núria Bonifaci, Bohdan Górski, Bartlomiej Masojć, Dominika Wokołorczyk, Anna Jakubowska, Tadeusz Dębniak, Antoni Berenguer, Jordi Serra Musach, Joan Brunet, Joaquín Dopazo, Steven A Narod, Jan Lubiński, Conxi Lázaro, Cezary Cybulski, and Miguel Angel Pujana. 2010. Exploring the Link between Germline and Somatic Genetic Alterations in Breast Carcinogenesis. PLOS ONE 5, 11 (2010), 1–8. https://doi.org/10.1371/journal.pone.0014078 Publisher: Public Library of Science.

[10]

Maria Carmela Cariello, Alessandro Lenci, and Ruslan Mitkov. 2021. A Comparison between Named Entity Recognition Models in the Biomedical Domain. INCOMA Ltd., Held Online, 76–84. https://aclanthology.org/2021.triton-1.9

[11]

Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. 2020. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. (2020). https://doi.org/10.48550/ARXIV.2010.09885 Publisher: arXiv.

[12]

Hyejin Cho, Wonjun Choi, and Hyunju Lee. 2017. A method for named entity normalization in biomedical articles: Application to diseases and plants. BMC Bioinformatics 18 (2017). https://doi.org/10.1186/s12859-017-1857-8

[13]

Wonjun Choi and Hyunju Lee. 2021. Identifying disease-gene associations using a convolutional neural network-based model by embedding a biological knowledge graph with entity descriptions. PLOS ONE 16, 10 (2021), 1–27. https://doi.org/10.1371/journal.pone.0258626 Publisher: Public Library of Science.

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2018). https://doi.org/10.48550/ARXIV.1810.04805 Publisher: arXiv.

[15]

Rezarta Islamaj Doğan, Robert Leaman, and Zhiyong Lu. 2014. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics 47 (2014), 1–10. https://doi.org/10.1016/j.jbi.2013.12.006

Digital Library

[16]

Fan Feng, Feitong Tang, Yijia Gao, Dongyu Zhu, Tianjun Li, Shuyuan Yang, Yuan Yao, Yuanhao Huang, and Jie Liu. 2022. GenomicKB: a knowledge graph for the human genome. Nucleic Acids Research 51, D1 (2022), D950–D956. https://doi.org/10.1093/nar/gkac957

[17]

Ken Fukuda, Akihiro Tamura, Tatsuhiko Tsunoda, and Toshihisa Takagi. 1998. Toward information extraction: identifying protein names from biological papers. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing (1998), 707–718.

[18]

Roxana Girju, Adriana Badulescu, and Dan Moldovan. 2006. Automatic Discovery of Part-Whole Relations. Computational Linguistics 32, 1 (2006), 83–135. https://doi.org/10.1162/coli.2006.32.1.83

Digital Library

[19]

Kata Gábor, Davide Buscaldi, Anne-Kathrin Schumann, Behrang QasemiZadeh, Haïfa Zargayouna, and Thierry Charnois. 2018. SemEval-2018 Task 7: Semantic Relation Extraction and Classification in Scientific Papers. Association for Computational Linguistics, New Orleans, Louisiana, 679–688. https://doi.org/10.18653/v1/S18-1111

[20]

Jeongkyun Kim, Jung-Jae Kim, and Hyunju Lee. 2019. DigChem: Identification of disease-gene-chemical relationships from Medline abstracts. PLOS Computational Biology 15, 5 (2019), 1–16. https://doi.org/10.1371/journal.pcbi.1007022 Publisher: Public Library of Science.

[21]

Robert Leaman, Ritu Khare, and Zhiyong Lu. 2015. Challenges in clinical natural language processing for automated disorder normalization. Journal of Biomedical Informatics 57 (2015), 28–37. https://doi.org/10.1016/j.jbi.2015.07.010

Digital Library

[22]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2019), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682

[23]

Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2018. A Survey on Deep Learning for Named Entity Recognition. (2018). https://doi.org/10.48550/ARXIV.1812.09449 Publisher: arXiv.

[24]

Ling Luo, Po-Ting Lai, Chih-Hsuan Wei, Cecilia N Arighi, and Zhiyong Lu. 2022. BioRED: a rich biomedical relation extraction dataset. Briefings in Bioinformatics 23, 5 (2022). https://doi.org/10.1093/bib/bbac282

[25]

Mitchell P Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313–330. https://aclanthology.org/J93-2004 Place: Cambridge, MA Publisher: MIT Press.

Digital Library

[26]

Nicolas Matentzoglu, Damien Goutte-Gattat, Shawn Zheng Kai Tan, James P Balhoff, Seth Carbon, Anita R Caron, William D Duncan, Joe E Flack, Melissa Haendel, Nomi L Harris, William R Hogan, Charles Tapley Hoyt, Rebecca C Jackson, Hyeongsik Kim, Huseyin Kir, Martin Larralde, Julie A McMurry, James A Overton, Bjoern Peters, Clare Pilgrim, Ray Stefancsik, Sofia M C Robb, Sabrina Toro, Nicole A Vasilevsky, Ramona Walls, Christopher J Mungall, and David Osumi-Sutherland. 2022. Ontology Development Kit: a toolkit for building, maintaining and standardizing biomedical ontologies. Database 2022 (2022). https://doi.org/10.1093/database/baac087

[27]

Nikola Milošević and Wolfgang Thielemann. 2023. Comparison of biomedical relationship extraction methods and models for knowledge graph creation. Journal of Web Semantics 75 (Jan. 2023), 100756. https://doi.org/10.1016/j.websem.2022.100756 Publisher: Elsevier BV.

Digital Library

[28]

Mariana Neves, José-María Carazo, and Alberto Pascual-Montano. 2010. Moara: A Java library for extracting and normalizing gene and protein mentions. BMC bioinformatics 11 (2010), 157. https://doi.org/10.1186/1471-2105-11-157

[29]

Jiho Noh and Ramakanth Kavuluru. 2021. Joint Learning for Biomedical NER and Entity Normalization: Encoding Schemes, Counterfactual Examples, and Zero-Shot Evaluation. In BCB ’21. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3459930.3469533 Journal Abbreviation: BCB ’21.

Digital Library

[30]

Sachin Pawar, Girish K Palshikar, and Pushpak Bhattacharyya. 2017. Relation Extraction : A Survey. (2017). https://doi.org/10.48550/ARXIV.1712.05191 Publisher: arXiv.

[31]

Steven L Salzberg. 2018. Open questions: How many genes do we have?BMC Biology 16, 1 (Aug. 2018). https://doi.org/10.1186/s12915-018-0564-x Publisher: BioMed Central.

[32]

Gurnoor Singh, Evangelia A Papoutsoglou, Frederique Keijts-Lalleman, Bilyana Vencheva, Mark Rice, Richard G F Visser, Christian W B Bachem, and Richard Finkers. 2021. Extracting knowledge networks from plant scientific literature: potato tuber flesh color as an exemplary trait. BMC Plant Biology 21, 1 (April 2021). https://doi.org/10.1186/s12870-021-02943-5 Publisher: Springer Verlag.

[33]

Larry L Smith, Lorraine K Tanabe, Rie Ando, Cheng-Ju Kuo, I-Fang Chung, Chun-Nan Hsu, Yu-Shi Lin, Roman Klinger, C Friedrich, Kuzman Ganchev, Manabu Torii, Hongfang Liu, Barry Haddow, Craig A Struble, Richard J Povinelli, Andreas Vlachos, William A Baumgartner, Lawrence E Hunter, Bob Carpenter, Richard Tzong-Han Tsai, Hong-Jie Dai, Feng Liu, Yifei Chen, Chengjie Sun, Sophia Katrenko, Pieter W Adriaans, Christian Blaschke, Rafael Torres, Mariana L Neves, Preslav Nakov, Anna Divoli, Manuel Maña-López, Jacinto Mata, and W John Wilbur. 2008. Overview of BioCreative II gene mention recognition. Genome Biology 9 (2008), S2–S2.

[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. (2017). https://doi.org/10.48550/ARXIV.1706.03762 Publisher: arXiv.

[35]

Shilpa Verma, Rajesh Bhatia, Sandeep Harit, and Sanjay Batish. 2023. Scholarly knowledge graphs through structuring scholarly communication: a review. Complex & intelligent systems 9, 1 (2023), 1059–1095. https://doi.org/10.1007/s40747-022-00806-6

[36]

Xinglong Wang, Jun’ichi Tsujii, and Sophia Ananiadou. 2009. Classifying Relations for Biomedical Named Entity Disambiguation. Association for Computational Linguistics, Singapore, 1513–1522. https://aclanthology.org/D09-1157

[37]

Jonathan J Webster and Chunyu Kit. 1992. Tokenization as the Initial Phase in NLP. In COLING ’92. Association for Computational Linguistics, USA, 1106–1110. https://doi.org/10.3115/992424.992434 Journal Abbreviation: COLING ’92.

Digital Library

[38]

Patricia Whetzel, Natasha Noy, Nigam Shah, Paul Alexander, Csongor Nyulas, Tania Tudorache, and Mark Musen. 2011. BioPortal: Enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic acids research 39 (2011), W541–5. https://doi.org/10.1093/nar/gkr469

[39]

Ye Wu, Ruibang Luo, Henry C M Leung, Hing-Fung Ting, and Tak Wah Lam. 2019. RENET: A Deep Learning Approach for Extracting Gene-Disease Associations from Literature.

[40]

Jie Yang, Soyeon Caren Han, and Josiah Poon. 2021. A Survey on Extraction of Causal Relations from Natural Language Text. (2021). https://doi.org/10.48550/ARXIV.2101.06426 Publisher: arXiv.

[41]

Xi Yang, Chengkun Wu, Goran Nenadic, Wei Wang, and Kai Lu. 2021. Mining a stroke knowledge graph from literature. BMC Bioinformatics 22, S10 (July 2021). https://doi.org/10.1186/s12859-021-04292-4 Publisher: Springer Nature.

[42]

Qian Zhu, Dac-Trung Nguyen, Ivan Grishagin, Noel Southall, Eric Sid, and Anne Pariser. 2020. An integrative knowledge graph for rare diseases, derived from the Genetic and Rare Diseases Information Center (GARD). Journal of Biomedical Semantics 11 (2020). https://doi.org/10.1186/s13326-020-00232-y

[43]

Xian Zhu, Yueming Gu, and Zhifeng Xiao. 2022. HerbKG: Constructing a Herbal-Molecular Medicine Knowledge Graph Using a Two-Stage Framework Based on Deep Transfer Learning. Frontiers in Genetics 13 (2022). https://doi.org/10.3389/fgene.2022.799349

Cited By

Youn JLi FSimmons GKim STagkopoulos I(2024)FoodAtlasComputers in Biology and Medicine10.1016/j.compbiomed.2024.109072181:COnline publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1016/j.compbiomed.2024.109072

Index Terms

Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical Literature
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Data management systems
    1. Database design and models
      1. Graph-based database models

Recommendations

Extraction of gene-disease association from literature using BioBERT
CONF-CDS 2021: The 2nd International Conference on Computing and Data Science

With the rapid growth of biomedical literatures, there are a large amount of bio-text data to be exploited. A wealth of knowledge concerning diseases associated with genes is present in those bio-text which is important for studies like drug-target ...
Gene-disease association with literature based enrichment

Graphical abstractDisplay Omitted Knowledge-based functional enrichment for gene prioritization of high throughput data.Automatic ontology generation from MEDLINE.Novel and fully automatic literature-based discovery.Literature ontologies perform better ...
Text mining biomedical literature for constructing gene regulatory networks

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICISDM '23: Proceedings of the 2023 7th International Conference on Information System and Data Mining

May 2023

109 pages

ISBN:9798400700637

DOI:10.1145/3603765

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICISDM 2023

ICISDM 2023: 2023 the 7th International Conference on Information System and Data Mining

May 10 - 12, 2023

Atlanta, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
53
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)3

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Youn JLi FSimmons GKim STagkopoulos I(2024)FoodAtlasComputers in Biology and Medicine10.1016/j.compbiomed.2024.109072181:COnline publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1016/j.compbiomed.2024.109072

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten