research-article

Validating Ontology-based Annotations of Biomedical Resources using Zero-shot Learning

Author:

Dimitrios KoutsomitropoulosAuthors Info & Claims

CSBio2021: The 12th International Conference on Computational Systems-Biology and Bioinformatics

October 2021

Pages 37 - 43

https://doi.org/10.1145/3486713.3486730

Published: 14 December 2021 Publication History

Abstract

Authoritative thesauri in the form of web ontologies offer a sound representation of domain knowledge and can act as a reference point for automated semantic tagging. On the other hand, current language models achieve to capture contextualized semantics of text corpora and can be leveraged towards this goal. We present an approach for injecting subject annotations using query term expansion against such ontologies in the biomedical domain. For the user to have an indication of the usefulness of these suggestions we further propose an online method for validating the quality of annotations using NLI models such as BART and XLM-R. To circumvent training barriers posed by very large label sets and scarcity of data we rely on zero-shot classification and show that semantic matching can contribute above-average thematic annotations. Also, a web-based validation service can be attractive for human curators vs. the overhead of pretraining large, domain-tailored classification models.

References

[1]

Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), pp. 632-642

[2]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.

[3]

Chang, M. W., Ratinov, L. A., Roth, D., & Srikumar, V. (2008). Importance of Semantic Representation: Dataless Classification. In Aaai (Vol. 2, pp. 830-835).

Digital Library

[4]

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. In Proc. of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 8440–8451.

[5]

Conneau, A., Lample, G., Rinott, R., Williams, A., Bowman, S. R., Schwenk, H., & Stoyanov, V. (2018). XNLI: Evaluating cross-lingual sentence representations. In Proc. of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pp. 2475–2485.

[6]

Dai S, You R, Lu Z, Huang X, Mamitsuka H, Zhu S (2020) FullMeSH: improving large-scale MeSH indexing with full text. Bioinformatics (Oxford, England), 36(5), 1533–1541. https://doi.org/10.1093/bioinformatics/btz756

[7]

Davis, E., Cochran, D., Fagerheim, B., & Thoms, B. (2016) Enhancing Teaching and Learning: Libraries and Open Educational Resources in the Classroom. Public Services Quarterly, 12(1), 22-35.

[8]

Devlin, J, Chang, M. W., Lee, K., Toutanova, K. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Un-derstanding, In Proc. of NAACL-HLT 2019, pp. 4171–4186.

[9]

Dietze, S., Yu, H. Q., Giordano, D., Kaldoudi, E., Dovrolis, N. & Taibi, D. (2012). Linked education: interlinking educational resources and the web of data. In: The 27th ACM Symposium On Applied Computing (SAC-2012), Special Track on Semantic Web and Applications.

Digital Library

[10]

Europe PMC Consortium. (2017) Europe PMC: A Full-Text Literature Database for the Life Sciences and Platform for Innovation. Nucleic Acids Research 43. Database issue (2015): D1042–D1048.

[11]

Haslhofer, B., Martins, F., & Magalhães, J. (2013). Using SKOS vocabularies for improving web search. In Proceedings of the 22nd international conference on World Wide Web companion (pp. 1253-1258). International World Wide Web Conferences Steering Committee.

Digital Library

[12]

Huggingface (2021). Accelerated Inference API (online). Available : https://api-inference.huggingface.co/docs/python/html/index.html

[13]

Koutsomitropoulos, D. (2019) Semantic annotation and harvesting of federated scholarly data using ontologies. Digital Library Perspectives 35 (3–4), 157–171 (2019)

[14]

Koutsomitropoulos, D. A., and Andriopoulos, A. (2021). Thesaurus-based Word Embeddings for Automated Biomedical Literature Classification. Neural Computing and Applications, in press.

[15]

Koutsomitropoulos, D. A., and Solomou, G. D. (2018). A learning object ontology repository to support annotation and discovery of educational resources using semantic thesauri. IFLA journal 44 (1), 4-22.

[16]

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240.

[17]

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V. and Zettlemoyer, L. (2019). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.

[18]

McMartin, F. (2006) MERLOT: a model for user involvement in digital library design and implementation. Journal of Digital Information, 5 (3).

[19]

Miles, A., and Bechhofer, S., eds. (2009) SKOS Simple Knowledge Organization System Reference. W3C Recommendation. Available: http://www.w3.org/TR/skos-reference

[20]

National Documentation Center. (2021) Thesaurus of Greek Terms. Available: http://general-terms.thesaurus.ekt.gr/vocab/index.php

[21]

Rajabi, E., Alonso, S.S., & Sicilia, M. (2015). Interlinking educational resources to Web of Data through IEEE LOM. Computer Science and Information Systems, 12(1), 233–255.

[22]

Segura, N. A., García-Barriocanal, E., & Prieto, M. (2011). An empirical analysis of ontology-based query expansion for learning resource searches using MERLOT and the Gene ontology. Knowledge-Based Systems, 24(1), 119-133.

Digital Library

[23]

Ternier, S., Verbert, K., Parra, G., Vandeputte, B., Klerkx, J., Duval, E., (2009). The ariadne infrastructure for managing and storing metadata. IEEE Internet Computing, 13(4).

Digital Library

[24]

U.S. National Library of Medicine. Medical Subject Headings, 2021. Online. Available: https://www.nlm.nih.gov/mesh/meshhome.html

[25]

U.S. National Library of Medicine.gov Online. https://www.nlm.nih.gov/databases/download/ _medline.html

[26]

Wenige, L., Berger, G., & Ruhland, J. (2018). SKOS-based concept expansion for LOD-enabled recommender systems. In Proc. of the 12th International Conference on Metadata and Semantics Research (MTSR 2018), pp. 101-112, Springer.

[27]

Williams, A., Nangia, N., & Bowman, S. R. (2017). A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (ACL): Human Language Technologies, Volume 1, pp. 1112-1122

[28]

Yin, W., Hay, J., & Roth, D. (2019). Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), pp. 3914-3923.

[29]

You, R., Liu, Y., Mamitsuka, H. and Zhu, S.(2020) BERTMeSH: Deep Contextual Representation Learning for Large-scale High-performance MeSH Indexing with Full Text. Bioinformatics, 2020. https://doi.org/10.1093/bioinformatics/btaa837

Cited By

Kartchner DAl-Hussaini ITurner HDeng JLohiya SBathala PMitchell CChen HDuh WHuang HKato MMothe JPoblete B(2023)BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-AnalysisProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591897(2913-2923)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591897
Petrou NChristodoulou CAnastasiou APallis GDikaiakos M(2023)A Multiple change-point detection framework on linguistic characteristics of real versus fake news articlesScientific Reports10.1038/s41598-023-32952-313:1Online publication date: 13-Apr-2023
https://doi.org/10.1038/s41598-023-32952-3

Index Terms

Validating Ontology-based Annotations of Biomedical Resources using Zero-shot Learning

Index terms have been assigned to the content through auto-classification.

Recommendations

The Hmong Medical Corpus: a biomedical corpus for a minority language
Abstract
Biomedical communication is an area that increasingly benefits from natural language processing (NLP) work. Biomedical named entity recognition (NER) in particular provides a foundation for advanced NLP applications, such as automated medical ...
Read More
A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations

Abbreviations and acronyms are widely used in the biomedical literature and many of them represent important biomedical concepts. Because many abbreviations are ambiguous (e.g., CAT denotes both chloramphenicol acetyl transferase and computed axial ...
Read More
Anticipating annotations and emerging trends in biomedical literature
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

The BioJournalMonitor is a decision support system for the analysis of trends and topics in the biomedical literature. Its main goal is to identify potential diagnostic and therapeutic biomarkers for specific diseases. Several data sources are ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CSBio2021: The 12th International Conference on Computational Systems-Biology and Bioinformatics

October 2021

97 pages

ISBN:9781450385107

DOI:10.1145/3486713

Editors:
Kitsuchart Pasupa
King Mongkut's Institute of Technology Ladkrabang
,
Chee Keong Kwoh
Nanyang Technological University
,
Sansanee Auephanwiriyakul
Chiang Mai University
,
Nipon Theera-umpon
Chiang Mai University

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 December 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CSBio2021

CSBio2021: The 12th International Conference on Computational Systems-Biology and Bioinformatics

October 14 - 15, 2021

Virtual (GMT+7 Bangkok Time), Thailand

Acceptance Rates

Overall Acceptance Rate 23 of 37 submissions, 62%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
62
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Kartchner DAl-Hussaini ITurner HDeng JLohiya SBathala PMitchell CChen HDuh WHuang HKato MMothe JPoblete B(2023)BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-AnalysisProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591897(2913-2923)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591897
Petrou NChristodoulou CAnastasiou APallis GDikaiakos M(2023)A Multiple change-point detection framework on linguistic characteristics of real versus fake news articlesScientific Reports10.1038/s41598-023-32952-313:1Online publication date: 13-Apr-2023
https://doi.org/10.1038/s41598-023-32952-3

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents