Abstract
This study explores integrating the Longformer model with Conditional Random Fields (CRF) for enhancing Named Entity Recognition (NER) in the domain of healthcare data processing. It specifically focuses on patient privacy information retrieval and time information normalization, utilizing the comprehensive ‘Artificial Intelligence CUP 2023: Privacy Protection and Medical Data Standardization Challenge Dataset’. This research is conducted within the context of the AI CUP 2023 competition, which is dedicated to the privacy protection and standardization of medical data. Our approach utilized the Longformer model, renowned for its effectiveness in handling extensive text sequences, and combined it with CRF to enhance entity recognition accuracy in Electronic Health Record (EHR) text notes. To tackle challenges such as lengthy texts and class distribution imbalances, we developed a specialized process for managing large-scale textual data. This involved segmenting extensive texts into manageable chunks of 4,096 characters, which allowed for more focused and efficient training. For prediction, we employed a sliding window technique to ensure seamless integration and analysis of these text segments. This strategy was crucial in accurately retrieving patient privacy information from lengthy healthcare records. Additionally, our methodology included the implementation of rule-based methods for time information normalization, further enhancing the applicability of our approach in the medical data domain. The combination of Longformer and CRF has proven effective in accurately identifying sensitive patient information and normalizing time-related data. This approach illustrates the synergy between deep learning models and traditional methods, showcasing a robust framework for enhancing the security and efficiency of healthcare data processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Liu, J., Li, C., Liu, S.: Utility of ChatGPT in clinical practice. J. Med. Internet Res. 25 (2023)
Qiu, J., et al.: Large AI models in health informatics: applications, challenges, and the future. IEEE J. Biomed. Health Inform. 27, 6074–6087 (2023)
Meskó, B., Topol, E.J.: The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit. Med. 6, 120 (2023)
Li, Y., Wehbe, R.M., Ahmad, F.S., Wang, H., Luo, Y.: A comparative study of pretrained language models for long clinical text. J. Am. Med. Inform. Assoc. 30, 340–347 (2022)
Johnson, A.E.W., Bulgarelli, L., Pollard, T.J.: Deidentification of free-text medical records using pre-trained bidirectional transformers. In: CHIL ‘20, pp. 214–221. Association for Computing Machinery, New York, NY, USA (2020)
Meaney, C., Hakimpour, W., Kalia, S., Moineddin, R.: A comparative evaluation of transformer models for de-identification of clinical text data. arXiv:2204.07056 (2022)
Catelli, R., Gargiulo, F., Damiano, E., Esposito, M., De Pietro, G.: Clinical De-identification using sub-document analysis and ELECTRA. In: 2021 IEEE International Conference on Digital Health (ICDH), pp. 266–275 (2021)
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticæ Investigationes 30, 3–26 (2007)
Keretna, S., Lim, C.P., Creighton, D.: A Hybrid Model for Named Entity Recognition Using Unstructured Medical Text, pp. 85–90 (2014)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv:2004.05150 (2020)
Li, Y., Wehbe, R.M., Ahmad, F.S., Wang, H., Luo, Y.: Clinical-longformer and clinical-BigBird: transformers for long clinical sequences. arXiv:2201.11838 (2022)
Crichton, G.K.O., Pyysalo, S., Chiu, B., Korhonen, A.: A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinform. 18 (2017)
Knafou, J., Naderi, N., Copara, J., Teodoro, D., Ruch, P.: BiTeM at WNUT 2020 shared task-1: named entity recognition over wet lab protocols using an ensemble of contextual language models. In: Proceedings of the Workshop on Noisy User-Generated Text (W-NUT 2020), pp. 305–313 (2020)
Schneider, E.T., Zavala, R.M.R., MartÃnez, P., Moro, C., Paraiso, E.: UC3M-PUCPR at SemEval-2022 task 11: an ensemble method of transformer-based models for complex named entity recognition. In: Proceedings of the International Workshop on Semantic Evaluation (SemEval-2022), pp. 1448–1456 (2022)
SREDH/AI-Cup 2023: SREDH/AI-Cup 2023 Deidentification Competition. https://www.sredhconsortium.org/sredh-competitions/sredhai-cup-2023
Secure Research Environment for Digital Health (SREDH) Consortium. https://www.sredhconsortium.org/
Chen, A., Jonnagaddala, J., Nekkantti, C., Liaw, S.-T.: Generation of surrogates for de-identification of electronic health records. In: Studies in Health Technology and Informatics. MEDINFO 2019: Health and Wellbeing e-Networks for All, vol. 264, pp. 70–73. IOS Press (2019)
Jonnagaddala, J., Chen, A., Batongbacal, S., Nekkantti, C.: The OpenDeID corpus for patient de-identification. Sci. Rep. 11, 19973 (2021)
CodaLab: Protection and Medical Data Standardization Competition: Decoding Clinical Cases, Letting Data Tell the Story. https://codalab.lisn.upsaclay.fr/competitions/15425
Alla, N.L.V., Chen, A., Batongbacal, S., Nekkantti, C., Dai, H.-J., Jonnagaddala, J.: Cohort selection for construction of a clinical natural language processing corpus. Comput. Methods Programs Biomed. Update 1, 100024 (2021)
Liu, J., et al.: OpenDeID pipeline for unstructured electronic health record text notes based on rules and transformers: deidentification algorithm development and validation study. J. Med. Internet Res. (2023)
Mir, T.H., et al.: Deidentification and temporal normalisation of the electronic health record notes using large language models: the SREDH/AI-Cup 2023 deidentification competition. In: 2024 International Workshop on Deidentification of Electronic Medical Record Notes. Springer, Kaohsiung (2024)
Acknowledgments
We thank the AI Cup 2023 organizers, particularly the Taiwan Ministry of Education, the Intelligent Systems Laboratory at National Kaohsiung University of Science and Technology, the Department of Bioinformatics and Medical Engineering at Asia University, and the SREDH Consortium, for their support and for providing the dataset crucial for our healthcare AI research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
The authors declare that they have no competing interests relevant to the content of this article to disclose.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tseng, FP., Ko, HC., Hou, XY., Wijaya, D., Chang, C.KL., Tsai, R.TH. (2025). Patient Privacy Information Retrieval with Longformer and CRF, Followed by Rule-Based Time Information Normalization: A Dual-Approach Study. In: Jonnagaddala, J., Dai, HJ., Chen, CT. (eds) Large Language Models for Automatic Deidentification of Electronic Health Record Notes. IW-DMRN 2024. Communications in Computer and Information Science, vol 2148. Springer, Singapore. https://doi.org/10.1007/978-981-97-7966-6_11
Download citation
DOI: https://doi.org/10.1007/978-981-97-7966-6_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-7965-9
Online ISBN: 978-981-97-7966-6
eBook Packages: Computer ScienceComputer Science (R0)