Abstract
Abbreviations are unavoidable yet critical parts of the medical text. Using abbreviations, especially in clinical patient notes, can save time and space, protect sensitive information, and help avoid repetitions. However, most abbreviations might have multiple senses, and the lack of a standardized mapping system makes disambiguating abbreviations a difficult and time-consuming task. The main objective of this study is to examine the feasibility of sequence labeling methods for medical abbreviation disambiguation. Specifically, we explore the capability of sequence labeling methods to deal with multiple unique abbreviations in a single text. We use two public datasets to compare and contrast the performance of several transformer models pre-trained on different scientific and medical corpora. Our proposed sequence labeling approach outperforms the more commonly used text classification models for the abbreviation disambiguation task. In particular, the SciBERT model shows a strong performance for both sequence labeling and text classification tasks over the two considered datasets. Furthermore, we find that abbreviation disambiguation performance for the text classification models becomes comparable to that of sequence labeling only when postprocessing is applied to their predictions, which involves filtering possible labels for an abbreviation based on the training data.
Similar content being viewed by others
Data Availability
All the datasets used in our analysis are publicly available, and the links to these datasets are provided as follows: \(\bullet \) MeDAL, https://www.kaggle.com/datasets/xhlulu/medal-emnlp; \(\bullet \) UMN, https://conservancy.umn.edu/handle/11299/137703
References
Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41:1–69
Agirre E, Edmonds P (2007) Word sense disambiguation: algorithms and applications, vol. 33. Springer Science & Business Media
Abbreviation Definition & Meaning (2022). https://www.merriam-webster.com/dictionary/abbreviation?utm_campaign=sd &utm_medium=serp &utm_source=jsonld#note-2
Jaber A, Martínez P (2022) Disambiguating clinical abbreviations using a one-fits-all classifier based on deep learning techniques. Methods Inf Med
Grossman LV, Mitchell EG, Hripcsak G, Weng C, Vawdrey DK (2018) A method for harmonization of clinical abbreviation and acronym sense inventories. J Biomed Inform 88:62–69
McInnes B, Pedersen T, Liu Y, Pakhomov S, Melton GB (2011) Using second-order vectors in a knowledge-based method for acronym disambiguation. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp 145–153
Xu H, Stetson PD, Friedman C (2012) Combining corpus-derived sense profiles with estimated frequency information to disambiguate clinical abbreviations. In: AMIA Annual Symposium Proceedings, vol. 2012. American Medical Informatics Association, p 1004
Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint. arXiv:1810.04805
Pakhomov S, Pedersen T, Chute CG (2005) Abbreviation and acronym disambiguation in clinical discourse. In AMIA Annual Symposium Proceedings, vol. 2005. American Medical Informatics Association, p 589
Joshi M, Pakhomov S, Pedersen T, Chute CG (2006) A comparative study of supervised learning as applied to acronym expansion in clinical reports. In AMIA Annual Symposium Proceedings, vol. 2006. American Medical Informatics Association, p 399
Moon S, Pakhomov S, Melton GB (2012) Automated disambiguation of acronyms and abbreviations in clinical texts: window and training size considerations. In: AMIA Annual Symposium Proceedings, vol. 2012. American Medical Informatics Association, p 1310
Jaber A, Martínez P (2021) Disambiguating clinical abbreviations using pre-trained word embeddings. In: HEALTHINF, pp 501–508
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36:1234–1240
McCallum A (2012) Efficiently inducing features of conditional random fields. arXiv preprint. arXiv:1212.2504
Quinlan JR (2004) Data mining tools see5 and c5. 0. http://www.rulequest.com/see5-info.html
Wu Y, Xu J, Zhang Y, Xu H (2015) Clinical abbreviation disambiguation using neural word embeddings. In Proceedings of BioNLP 15, pp 171–176
Li I, Yasunaga M, Nuzumlalı MY, Caraballo C, Mahajan S, Krumholz H, Radev D (2019) A neural topic-attention model for medical term abbreviation disambiguation. arXiv preprint. arXiv:1910.14076
Johnson AE, Pollard TJ, Shen L, Lehman L-WH, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG (2016) MIMIC-III, a freely accessible critical care database. Sci Data 3:1–9
Jin Q, Liu J, Lu X (2019) Deep contextualized biomedical abbreviation expansion. arXiv preprint. arXiv:1906.03360
Doğan RI, Leaman R, Lu Z (2014) NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform 47:1–10
Bravo À, Piñero J, Queralt-Rosinach N, Rautschka M, Furlong LI (2015) Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinform 16:1–17
Tsatsaronis G, Balikas G, Malakasiotis P, Partalas I, Zschunke M, Alvers MR, Weissenborn D, Krithara A, Petridis S, Polychronopoulos D et al (2015) An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform 16:1–28
Wen Z, Lu XH, Reddy S (2020) MeDAL: medical abbreviation disambiguation dataset for natural language understanding pretraining. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop, Association for Computational Linguistics, Online, pp 130–135. https://aclanthology.org/2020.clinicalnlp-1.15, https://doi.org/10.18653/v1/2020.clinicalnlp-1.15
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers). https://doi.org/10.18653/v1/n18-1202
Jin Q, Dhingra B, Cohen WW, Lu X (2019) Probing biomedical embeddings from language models. arXiv preprint. arXiv:1904.02181
D. Hanisch, K. Fundel, H.-T. Mevissen, R. Zimmer, J. Fluck (2005) ProMiner: rule-based protein and gene entity recognition. BMC Bioinform 6:1–9
Quimbaya AP, Múnera AS, Rivera RAG, Rodríguez JCD, Velandia OMM, Peña AAG, Labbé C (2016) Named entity recognition over electronic health records through a combined dictionary-based approach. Proc Comput Sci 100:55–61
Zhang S, Elhadad N (2013) Unsupervised biomedical named entity recognition: experiments with clinical and biological texts. J Biomed Inform 46:1088–1098
Settles B (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), pp 107–110
Yao L, Liu H, Liu Y, Li X, Anwar MW (2015) Biomedical named entity recognition based on deep neutral network. Int J Hybrid Inf Technol 8:279–288
Souza F, Nogueira R, Lotufo R (2019) Portuguese named entity recognition using BERT-CRF. arXiv preprint. arXiv:1909.10649
Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P (2020) K-BERT: enabling language representation with knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp 2901–2908
Lafferty J, McCallum A, Pereira FC (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Jozefowicz R, Zaremba W, Sutskever I (2015) An empirical exploration of recurrent network architectures. In: International Conference on Machine Learning. PMLR, pp 2342–2350
Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint. arXiv:1508.01991
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Sanh V, Debut L, Chaumond J, Wolf T (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint. arXiv:1910.01108
Peng Y, Yan S, Lu Z (2019) Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In: Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019), pp 58–65
MS - BERT (2020). https://huggingface.co/NLP4H/ms_bert
Beltagy I, Lo K, Cohan A (2019) SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp 3615–3620. https://aclanthology.org/D19-1371, https://doi.org/10.18653/v1/D19-1371
Moon S, Pakhomov S, Melton G (2012) Clinical abbreviation sense inventory. https://conservancy.umn.edu/handle/11299/137703
Author information
Authors and Affiliations
Contributions
All the co-authors contributed to the conception, design, implementation, writing, and review of the paper. Author order is alphabetical.
Corresponding author
Ethics declarations
Ethical Approval
Not applicable
Competing Interests
The authors declare no competing interests.
Appendix. Text classification postprocessing results
Appendix. Text classification postprocessing results
In this section, we have reported the detailed results for the text classification experiments in Sect. 4.2. Table 9 presents the performance values before and after applying postprocessing. Overall, we observe that almost all the models benefit from postprocessing. In particular, DistilBERT, BlueBERT, and MS-BERT experience a significant performance improvement for the MeDAL dataset. On the other hand, BioBERT and SciBERT models’ performances do not benefit from the postprocessing approach on the UMN dataset.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cevik, M., Mohammad Jafari, S., Myers, M. et al. Sequence Labeling for Disambiguating Medical Abbreviations. J Healthc Inform Res 7, 501–526 (2023). https://doi.org/10.1007/s41666-023-00146-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41666-023-00146-1