Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3603765.3603771acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicisdmConference Proceedingsconference-collections
research-article

Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical Literature

Published: 17 October 2023 Publication History

Abstract

Published biomedical information has and continues to rapidly increase. The recent advancements in Natural Language Processing (NLP), have generated considerable interest in automating the extraction, normalization, and representation of biomedical knowledge about entities such as genes and diseases. Our study analyzes germline abstracts in the construction of knowledge graphs of the immense work that has been done in this area for genes and diseases. This paper presents SimpleGermKG, an automatic knowledge graph construction approach that connects germline genes and diseases. For the extraction of genes and diseases, we employ BioBERT, a pre-trained BERT model on biomedical corpora. We propose an ontology-based and rule-based algorithm to standardize and disambiguate medical terms. For semantic relationships between articles, genes, and diseases, we implemented a part-whole relation approach to connect each entity with its data source and visualize them in a graph-based knowledge representation. Lastly, we discuss the knowledge graph applications, limitations, and challenges to inspire the future research of germline corpora. Our knowledge graph contains 297 genes, 130 diseases, and 46,747 triples. Graph-based visualizations are used to show the results.

References

[1]
J Abreu Vicente. 2022. drAbreu/bioBERT-NER-BC2GM_corpus. https://huggingface.co/drAbreu/bioBERT-NER-BC2GM_corpus
[2]
J Abreu Vicente. 2022. drAbreu/bioBERT-NER-NCBI_disease. https://huggingface.co/drAbreu/bioBERT-NER-NCBI_disease
[3]
Mohammed Ali Al-Garadi, Yuan-Chi Yang, and Abeed Sarker. 2022. The Role of Natural Language Processing during the COVID-19 Pandemic: Health Applications, Opportunities, and Challenges. Healthcare 10, 11 (2022). https://doi.org/10.3390/healthcare10112270
[4]
Tareq Al-Moslmi, Marc Gallofré Ocaña, Andreas L. Opdahl, and Csaba Veres. 2020. Named Entity Extraction for Knowledge Graphs: A Literature Overview. IEEE Access 8 (2020), 32862–32881. https://doi.org/10.1109/ACCESS.2020.2973928
[5]
Basel Alshaikhdeeb and Kamsuriah Ahmad. 2016. Biomedical Named Entity Recognition: A Review. International Journal on Advanced Science, Engineering and Information Technology 6, 6 (2016), 889–895. https://doi.org/10.18517/ijaseit.6.6.1367 Publisher: INSIGHT - Indonesian Society for Knowledge and Human Development.
[6]
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. (2019). https://doi.org/10.48550/ARXIV.1903.10676 Publisher: arXiv.
[7]
Roopal Bhatnagar, Sakshi Sardar, Maedeh Beheshti, and Jagdeep T Podichetty. 2022. How can natural language processing help model informed drug development?: a review. JAMIA Open 5, 2 (2022). https://doi.org/10.1093/jamiaopen/ooac043
[8]
Olivier Bodenreider, Joyce Mitchell, and A McCray. 2005. Biomedical ontologies. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 78 (2005), 76–78. https://doi.org/10.1142/9789812704856_0016
[9]
Núria Bonifaci, Bohdan Górski, Bartlomiej Masojć, Dominika Wokołorczyk, Anna Jakubowska, Tadeusz Dębniak, Antoni Berenguer, Jordi Serra Musach, Joan Brunet, Joaquín Dopazo, Steven A Narod, Jan Lubiński, Conxi Lázaro, Cezary Cybulski, and Miguel Angel Pujana. 2010. Exploring the Link between Germline and Somatic Genetic Alterations in Breast Carcinogenesis. PLOS ONE 5, 11 (2010), 1–8. https://doi.org/10.1371/journal.pone.0014078 Publisher: Public Library of Science.
[10]
Maria Carmela Cariello, Alessandro Lenci, and Ruslan Mitkov. 2021. A Comparison between Named Entity Recognition Models in the Biomedical Domain. INCOMA Ltd., Held Online, 76–84. https://aclanthology.org/2021.triton-1.9
[11]
Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. 2020. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. (2020). https://doi.org/10.48550/ARXIV.2010.09885 Publisher: arXiv.
[12]
Hyejin Cho, Wonjun Choi, and Hyunju Lee. 2017. A method for named entity normalization in biomedical articles: Application to diseases and plants. BMC Bioinformatics 18 (2017). https://doi.org/10.1186/s12859-017-1857-8
[13]
Wonjun Choi and Hyunju Lee. 2021. Identifying disease-gene associations using a convolutional neural network-based model by embedding a biological knowledge graph with entity descriptions. PLOS ONE 16, 10 (2021), 1–27. https://doi.org/10.1371/journal.pone.0258626 Publisher: Public Library of Science.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2018). https://doi.org/10.48550/ARXIV.1810.04805 Publisher: arXiv.
[15]
Rezarta Islamaj Doğan, Robert Leaman, and Zhiyong Lu. 2014. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics 47 (2014), 1–10. https://doi.org/10.1016/j.jbi.2013.12.006
[16]
Fan Feng, Feitong Tang, Yijia Gao, Dongyu Zhu, Tianjun Li, Shuyuan Yang, Yuan Yao, Yuanhao Huang, and Jie Liu. 2022. GenomicKB: a knowledge graph for the human genome. Nucleic Acids Research 51, D1 (2022), D950–D956. https://doi.org/10.1093/nar/gkac957
[17]
Ken Fukuda, Akihiro Tamura, Tatsuhiko Tsunoda, and Toshihisa Takagi. 1998. Toward information extraction: identifying protein names from biological papers. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing (1998), 707–718.
[18]
Roxana Girju, Adriana Badulescu, and Dan Moldovan. 2006. Automatic Discovery of Part-Whole Relations. Computational Linguistics 32, 1 (2006), 83–135. https://doi.org/10.1162/coli.2006.32.1.83
[19]
Kata Gábor, Davide Buscaldi, Anne-Kathrin Schumann, Behrang QasemiZadeh, Haïfa Zargayouna, and Thierry Charnois. 2018. SemEval-2018 Task 7: Semantic Relation Extraction and Classification in Scientific Papers. Association for Computational Linguistics, New Orleans, Louisiana, 679–688. https://doi.org/10.18653/v1/S18-1111
[20]
Jeongkyun Kim, Jung-Jae Kim, and Hyunju Lee. 2019. DigChem: Identification of disease-gene-chemical relationships from Medline abstracts. PLOS Computational Biology 15, 5 (2019), 1–16. https://doi.org/10.1371/journal.pcbi.1007022 Publisher: Public Library of Science.
[21]
Robert Leaman, Ritu Khare, and Zhiyong Lu. 2015. Challenges in clinical natural language processing for automated disorder normalization. Journal of Biomedical Informatics 57 (2015), 28–37. https://doi.org/10.1016/j.jbi.2015.07.010
[22]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2019), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682
[23]
Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2018. A Survey on Deep Learning for Named Entity Recognition. (2018). https://doi.org/10.48550/ARXIV.1812.09449 Publisher: arXiv.
[24]
Ling Luo, Po-Ting Lai, Chih-Hsuan Wei, Cecilia N Arighi, and Zhiyong Lu. 2022. BioRED: a rich biomedical relation extraction dataset. Briefings in Bioinformatics 23, 5 (2022). https://doi.org/10.1093/bib/bbac282
[25]
Mitchell P Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313–330. https://aclanthology.org/J93-2004 Place: Cambridge, MA Publisher: MIT Press.
[26]
Nicolas Matentzoglu, Damien Goutte-Gattat, Shawn Zheng Kai Tan, James P Balhoff, Seth Carbon, Anita R Caron, William D Duncan, Joe E Flack, Melissa Haendel, Nomi L Harris, William R Hogan, Charles Tapley Hoyt, Rebecca C Jackson, Hyeongsik Kim, Huseyin Kir, Martin Larralde, Julie A McMurry, James A Overton, Bjoern Peters, Clare Pilgrim, Ray Stefancsik, Sofia M C Robb, Sabrina Toro, Nicole A Vasilevsky, Ramona Walls, Christopher J Mungall, and David Osumi-Sutherland. 2022. Ontology Development Kit: a toolkit for building, maintaining and standardizing biomedical ontologies. Database 2022 (2022). https://doi.org/10.1093/database/baac087
[27]
Nikola Milošević and Wolfgang Thielemann. 2023. Comparison of biomedical relationship extraction methods and models for knowledge graph creation. Journal of Web Semantics 75 (Jan. 2023), 100756. https://doi.org/10.1016/j.websem.2022.100756 Publisher: Elsevier BV.
[28]
Mariana Neves, José-María Carazo, and Alberto Pascual-Montano. 2010. Moara: A Java library for extracting and normalizing gene and protein mentions. BMC bioinformatics 11 (2010), 157. https://doi.org/10.1186/1471-2105-11-157
[29]
Jiho Noh and Ramakanth Kavuluru. 2021. Joint Learning for Biomedical NER and Entity Normalization: Encoding Schemes, Counterfactual Examples, and Zero-Shot Evaluation. In BCB ’21. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3459930.3469533 Journal Abbreviation: BCB ’21.
[30]
Sachin Pawar, Girish K Palshikar, and Pushpak Bhattacharyya. 2017. Relation Extraction : A Survey. (2017). https://doi.org/10.48550/ARXIV.1712.05191 Publisher: arXiv.
[31]
Steven L Salzberg. 2018. Open questions: How many genes do we have?BMC Biology 16, 1 (Aug. 2018). https://doi.org/10.1186/s12915-018-0564-x Publisher: BioMed Central.
[32]
Gurnoor Singh, Evangelia A Papoutsoglou, Frederique Keijts-Lalleman, Bilyana Vencheva, Mark Rice, Richard G F Visser, Christian W B Bachem, and Richard Finkers. 2021. Extracting knowledge networks from plant scientific literature: potato tuber flesh color as an exemplary trait. BMC Plant Biology 21, 1 (April 2021). https://doi.org/10.1186/s12870-021-02943-5 Publisher: Springer Verlag.
[33]
Larry L Smith, Lorraine K Tanabe, Rie Ando, Cheng-Ju Kuo, I-Fang Chung, Chun-Nan Hsu, Yu-Shi Lin, Roman Klinger, C Friedrich, Kuzman Ganchev, Manabu Torii, Hongfang Liu, Barry Haddow, Craig A Struble, Richard J Povinelli, Andreas Vlachos, William A Baumgartner, Lawrence E Hunter, Bob Carpenter, Richard Tzong-Han Tsai, Hong-Jie Dai, Feng Liu, Yifei Chen, Chengjie Sun, Sophia Katrenko, Pieter W Adriaans, Christian Blaschke, Rafael Torres, Mariana L Neves, Preslav Nakov, Anna Divoli, Manuel Maña-López, Jacinto Mata, and W John Wilbur. 2008. Overview of BioCreative II gene mention recognition. Genome Biology 9 (2008), S2–S2.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. (2017). https://doi.org/10.48550/ARXIV.1706.03762 Publisher: arXiv.
[35]
Shilpa Verma, Rajesh Bhatia, Sandeep Harit, and Sanjay Batish. 2023. Scholarly knowledge graphs through structuring scholarly communication: a review. Complex & intelligent systems 9, 1 (2023), 1059–1095. https://doi.org/10.1007/s40747-022-00806-6
[36]
Xinglong Wang, Jun’ichi Tsujii, and Sophia Ananiadou. 2009. Classifying Relations for Biomedical Named Entity Disambiguation. Association for Computational Linguistics, Singapore, 1513–1522. https://aclanthology.org/D09-1157
[37]
Jonathan J Webster and Chunyu Kit. 1992. Tokenization as the Initial Phase in NLP. In COLING ’92. Association for Computational Linguistics, USA, 1106–1110. https://doi.org/10.3115/992424.992434 Journal Abbreviation: COLING ’92.
[38]
Patricia Whetzel, Natasha Noy, Nigam Shah, Paul Alexander, Csongor Nyulas, Tania Tudorache, and Mark Musen. 2011. BioPortal: Enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic acids research 39 (2011), W541–5. https://doi.org/10.1093/nar/gkr469
[39]
Ye Wu, Ruibang Luo, Henry C M Leung, Hing-Fung Ting, and Tak Wah Lam. 2019. RENET: A Deep Learning Approach for Extracting Gene-Disease Associations from Literature.
[40]
Jie Yang, Soyeon Caren Han, and Josiah Poon. 2021. A Survey on Extraction of Causal Relations from Natural Language Text. (2021). https://doi.org/10.48550/ARXIV.2101.06426 Publisher: arXiv.
[41]
Xi Yang, Chengkun Wu, Goran Nenadic, Wei Wang, and Kai Lu. 2021. Mining a stroke knowledge graph from literature. BMC Bioinformatics 22, S10 (July 2021). https://doi.org/10.1186/s12859-021-04292-4 Publisher: Springer Nature.
[42]
Qian Zhu, Dac-Trung Nguyen, Ivan Grishagin, Noel Southall, Eric Sid, and Anne Pariser. 2020. An integrative knowledge graph for rare diseases, derived from the Genetic and Rare Diseases Information Center (GARD). Journal of Biomedical Semantics 11 (2020). https://doi.org/10.1186/s13326-020-00232-y
[43]
Xian Zhu, Yueming Gu, and Zhifeng Xiao. 2022. HerbKG: Constructing a Herbal-Molecular Medicine Knowledge Graph Using a Two-Stage Framework Based on Deep Transfer Learning. Frontiers in Genetics 13 (2022). https://doi.org/10.3389/fgene.2022.799349

Cited By

View all

Index Terms

  1. Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical Literature

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ICISDM '23: Proceedings of the 2023 7th International Conference on Information System and Data Mining
      May 2023
      109 pages
      ISBN:9798400700637
      DOI:10.1145/3603765
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. BioBERT
      2. entity recognition
      3. germline mutations
      4. knowledge graph
      5. semantic relation

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      ICISDM 2023

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)30
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 25 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media