Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3486713.3486730acmotherconferencesArticle/Chapter ViewAbstractPublication PagescsbioConference Proceedingsconference-collections
research-article

Validating Ontology-based Annotations of Biomedical Resources using Zero-shot Learning

Published: 14 December 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Authoritative thesauri in the form of web ontologies offer a sound representation of domain knowledge and can act as a reference point for automated semantic tagging. On the other hand, current language models achieve to capture contextualized semantics of text corpora and can be leveraged towards this goal. We present an approach for injecting subject annotations using query term expansion against such ontologies in the biomedical domain. For the user to have an indication of the usefulness of these suggestions we further propose an online method for validating the quality of annotations using NLI models such as BART and XLM-R. To circumvent training barriers posed by very large label sets and scarcity of data we rely on zero-shot classification and show that semantic matching can contribute above-average thematic annotations. Also, a web-based validation service can be attractive for human curators vs. the overhead of pretraining large, domain-tailored classification models.

    References

    [1]
    Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), pp. 632-642
    [2]
    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
    [3]
    Chang, M. W., Ratinov, L. A., Roth, D., & Srikumar, V. (2008). Importance of Semantic Representation: Dataless Classification. In Aaai (Vol. 2, pp. 830-835).
    [4]
    Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. In Proc. of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 8440–8451.
    [5]
    Conneau, A., Lample, G., Rinott, R., Williams, A., Bowman, S. R., Schwenk, H., & Stoyanov, V. (2018). XNLI: Evaluating cross-lingual sentence representations. In Proc. of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pp. 2475–2485.
    [6]
    Dai S, You R, Lu Z, Huang X, Mamitsuka H, Zhu S (2020) FullMeSH: improving large-scale MeSH indexing with full text. Bioinformatics (Oxford, England), 36(5), 1533–1541. https://doi.org/10.1093/bioinformatics/btz756
    [7]
    Davis, E., Cochran, D., Fagerheim, B., & Thoms, B. (2016) Enhancing Teaching and Learning: Libraries and Open Educational Resources in the Classroom. Public Services Quarterly, 12(1), 22-35.
    [8]
    Devlin, J, Chang, M. W., Lee, K., Toutanova, K. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Un-derstanding, In Proc. of NAACL-HLT 2019, pp. 4171–4186.
    [9]
    Dietze, S., Yu, H. Q., Giordano, D., Kaldoudi, E., Dovrolis, N. & Taibi, D. (2012). Linked education: interlinking educational resources and the web of data. In: The 27th ACM Symposium On Applied Computing (SAC-2012), Special Track on Semantic Web and Applications.
    [10]
    Europe PMC Consortium. (2017) Europe PMC: A Full-Text Literature Database for the Life Sciences and Platform for Innovation. Nucleic Acids Research 43. Database issue (2015): D1042–D1048.
    [11]
    Haslhofer, B., Martins, F., & Magalhães, J. (2013). Using SKOS vocabularies for improving web search. In Proceedings of the 22nd international conference on World Wide Web companion (pp. 1253-1258). International World Wide Web Conferences Steering Committee.
    [12]
    Huggingface (2021). Accelerated Inference API (online). Available : https://api-inference.huggingface.co/docs/python/html/index.html
    [13]
    Koutsomitropoulos, D. (2019) Semantic annotation and harvesting of federated scholarly data using ontologies. Digital Library Perspectives 35 (3–4), 157–171 (2019)
    [14]
    Koutsomitropoulos, D. A., and Andriopoulos, A. (2021). Thesaurus-based Word Embeddings for Automated Biomedical Literature Classification. Neural Computing and Applications, in press.
    [15]
    Koutsomitropoulos, D. A., and Solomou, G. D. (2018). A learning object ontology repository to support annotation and discovery of educational resources using semantic thesauri. IFLA journal 44 (1), 4-22.
    [16]
    Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240.
    [17]
    Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V. and Zettlemoyer, L. (2019). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
    [18]
    McMartin, F. (2006) MERLOT: a model for user involvement in digital library design and implementation. Journal of Digital Information, 5 (3).
    [19]
    Miles, A., and Bechhofer, S., eds. (2009) SKOS Simple Knowledge Organization System Reference. W3C Recommendation. Available: http://www.w3.org/TR/skos-reference
    [20]
    National Documentation Center. (2021) Thesaurus of Greek Terms. Available: http://general-terms.thesaurus.ekt.gr/vocab/index.php
    [21]
    Rajabi, E., Alonso, S.S., & Sicilia, M. (2015). Interlinking educational resources to Web of Data through IEEE LOM. Computer Science and Information Systems, 12(1), 233–255.
    [22]
    Segura, N. A., García-Barriocanal, E., & Prieto, M. (2011). An empirical analysis of ontology-based query expansion for learning resource searches using MERLOT and the Gene ontology. Knowledge-Based Systems, 24(1), 119-133.
    [23]
    Ternier, S., Verbert, K., Parra, G., Vandeputte, B., Klerkx, J., Duval, E., (2009). The ariadne infrastructure for managing and storing metadata. IEEE Internet Computing, 13(4).
    [24]
    U.S. National Library of Medicine. Medical Subject Headings, 2021. Online. Available: https://www.nlm.nih.gov/mesh/meshhome.html
    [25]
    U.S. National Library of Medicine.gov Online. https://www.nlm.nih.gov/databases/download/ _medline.html
    [26]
    Wenige, L., Berger, G., & Ruhland, J. (2018). SKOS-based concept expansion for LOD-enabled recommender systems. In Proc. of the 12th International Conference on Metadata and Semantics Research (MTSR 2018), pp. 101-112, Springer.
    [27]
    Williams, A., Nangia, N., & Bowman, S. R. (2017). A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (ACL): Human Language Technologies, Volume 1, pp. 1112-1122
    [28]
    Yin, W., Hay, J., & Roth, D. (2019). Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), pp. 3914-3923.
    [29]
    You, R., Liu, Y., Mamitsuka, H. and Zhu, S.(2020) BERTMeSH: Deep Contextual Representation Learning for Large-scale High-performance MeSH Indexing with Full Text. Bioinformatics, 2020. https://doi.org/10.1093/bioinformatics/btaa837

    Cited By

    View all
    • (2023)BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-AnalysisProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591897(2913-2923)Online publication date: 19-Jul-2023
    • (2023)A Multiple change-point detection framework on linguistic characteristics of real versus fake news articlesScientific Reports10.1038/s41598-023-32952-313:1Online publication date: 13-Apr-2023

    Index Terms

    1. Validating Ontology-based Annotations of Biomedical Resources using Zero-shot Learning
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Information & Contributors

              Information

              Published In

              cover image ACM Other conferences
              CSBio2021: The 12th International Conference on Computational Systems-Biology and Bioinformatics
              October 2021
              97 pages
              ISBN:9781450385107
              DOI:10.1145/3486713
              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              Published: 14 December 2021

              Permissions

              Request permissions for this article.

              Check for updates

              Author Tags

              1. MeSH
              2. Thesaurus
              3. biomedical indexing
              4. classification
              5. language models
              6. machine learning
              7. semantic matching

              Qualifiers

              • Research-article
              • Research
              • Refereed limited

              Conference

              CSBio2021

              Acceptance Rates

              Overall Acceptance Rate 23 of 37 submissions, 62%

              Contributors

              Other Metrics

              Bibliometrics & Citations

              Bibliometrics

              Article Metrics

              • Downloads (Last 12 months)21
              • Downloads (Last 6 weeks)1

              Other Metrics

              Citations

              Cited By

              View all
              • (2023)BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-AnalysisProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591897(2913-2923)Online publication date: 19-Jul-2023
              • (2023)A Multiple change-point detection framework on linguistic characteristics of real versus fake news articlesScientific Reports10.1038/s41598-023-32952-313:1Online publication date: 13-Apr-2023

              View Options

              Get Access

              Login options

              View options

              PDF

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              HTML Format

              View this article in HTML Format.

              HTML Format

              Media

              Figures

              Other

              Tables

              Share

              Share

              Share this Publication link

              Share on social media