Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Patient Privacy Information Retrieval with Longformer and CRF, Followed by Rule-Based Time Information Normalization: A Dual-Approach Study

  • Conference paper
  • First Online:
Large Language Models for Automatic Deidentification of Electronic Health Record Notes (IW-DMRN 2024)

Abstract

This study explores integrating the Longformer model with Conditional Random Fields (CRF) for enhancing Named Entity Recognition (NER) in the domain of healthcare data processing. It specifically focuses on patient privacy information retrieval and time information normalization, utilizing the comprehensive ‘Artificial Intelligence CUP 2023: Privacy Protection and Medical Data Standardization Challenge Dataset’. This research is conducted within the context of the AI CUP 2023 competition, which is dedicated to the privacy protection and standardization of medical data. Our approach utilized the Longformer model, renowned for its effectiveness in handling extensive text sequences, and combined it with CRF to enhance entity recognition accuracy in Electronic Health Record (EHR) text notes. To tackle challenges such as lengthy texts and class distribution imbalances, we developed a specialized process for managing large-scale textual data. This involved segmenting extensive texts into manageable chunks of 4,096 characters, which allowed for more focused and efficient training. For prediction, we employed a sliding window technique to ensure seamless integration and analysis of these text segments. This strategy was crucial in accurately retrieving patient privacy information from lengthy healthcare records. Additionally, our methodology included the implementation of rule-based methods for time information normalization, further enhancing the applicability of our approach in the medical data domain. The combination of Longformer and CRF has proven effective in accurately identifying sensitive patient information and normalizing time-related data. This approach illustrates the synergy between deep learning models and traditional methods, showcasing a robust framework for enhancing the security and efficiency of healthcare data processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Liu, J., Li, C., Liu, S.: Utility of ChatGPT in clinical practice. J. Med. Internet Res. 25 (2023)

    Google Scholar 

  2. Qiu, J., et al.: Large AI models in health informatics: applications, challenges, and the future. IEEE J. Biomed. Health Inform. 27, 6074–6087 (2023)

    Article  MATH  Google Scholar 

  3. Meskó, B., Topol, E.J.: The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit. Med. 6, 120 (2023)

    Google Scholar 

  4. Li, Y., Wehbe, R.M., Ahmad, F.S., Wang, H., Luo, Y.: A comparative study of pretrained language models for long clinical text. J. Am. Med. Inform. Assoc. 30, 340–347 (2022)

    Article  MATH  Google Scholar 

  5. Johnson, A.E.W., Bulgarelli, L., Pollard, T.J.: Deidentification of free-text medical records using pre-trained bidirectional transformers. In: CHIL ‘20, pp. 214–221. Association for Computing Machinery, New York, NY, USA (2020)

    Google Scholar 

  6. Meaney, C., Hakimpour, W., Kalia, S., Moineddin, R.: A comparative evaluation of transformer models for de-identification of clinical text data. arXiv:2204.07056 (2022)

  7. Catelli, R., Gargiulo, F., Damiano, E., Esposito, M., De Pietro, G.: Clinical De-identification using sub-document analysis and ELECTRA. In: 2021 IEEE International Conference on Digital Health (ICDH), pp. 266–275 (2021)

    Google Scholar 

  8. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticæ Investigationes 30, 3–26 (2007)

    Article  MATH  Google Scholar 

  9. Keretna, S., Lim, C.P., Creighton, D.: A Hybrid Model for Named Entity Recognition Using Unstructured Medical Text, pp. 85–90 (2014)

    Google Scholar 

  10. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv:2004.05150 (2020)

  11. Li, Y., Wehbe, R.M., Ahmad, F.S., Wang, H., Luo, Y.: Clinical-longformer and clinical-BigBird: transformers for long clinical sequences. arXiv:2201.11838 (2022)

  12. Crichton, G.K.O., Pyysalo, S., Chiu, B., Korhonen, A.: A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinform. 18 (2017)

    Google Scholar 

  13. Knafou, J., Naderi, N., Copara, J., Teodoro, D., Ruch, P.: BiTeM at WNUT 2020 shared task-1: named entity recognition over wet lab protocols using an ensemble of contextual language models. In: Proceedings of the Workshop on Noisy User-Generated Text (W-NUT 2020), pp. 305–313 (2020)

    Google Scholar 

  14. Schneider, E.T., Zavala, R.M.R., Martínez, P., Moro, C., Paraiso, E.: UC3M-PUCPR at SemEval-2022 task 11: an ensemble method of transformer-based models for complex named entity recognition. In: Proceedings of the International Workshop on Semantic Evaluation (SemEval-2022), pp. 1448–1456 (2022)

    Google Scholar 

  15. SREDH/AI-Cup 2023: SREDH/AI-Cup 2023 Deidentification Competition. https://www.sredhconsortium.org/sredh-competitions/sredhai-cup-2023

  16. Secure Research Environment for Digital Health (SREDH) Consortium. https://www.sredhconsortium.org/

  17. Chen, A., Jonnagaddala, J., Nekkantti, C., Liaw, S.-T.: Generation of surrogates for de-identification of electronic health records. In: Studies in Health Technology and Informatics. MEDINFO 2019: Health and Wellbeing e-Networks for All, vol. 264, pp. 70–73. IOS Press (2019)

    Google Scholar 

  18. Jonnagaddala, J., Chen, A., Batongbacal, S., Nekkantti, C.: The OpenDeID corpus for patient de-identification. Sci. Rep. 11, 19973 (2021)

    Google Scholar 

  19. CodaLab: Protection and Medical Data Standardization Competition: Decoding Clinical Cases, Letting Data Tell the Story. https://codalab.lisn.upsaclay.fr/competitions/15425

  20. Alla, N.L.V., Chen, A., Batongbacal, S., Nekkantti, C., Dai, H.-J., Jonnagaddala, J.: Cohort selection for construction of a clinical natural language processing corpus. Comput. Methods Programs Biomed. Update 1, 100024 (2021)

    Article  Google Scholar 

  21. Liu, J., et al.: OpenDeID pipeline for unstructured electronic health record text notes based on rules and transformers: deidentification algorithm development and validation study. J. Med. Internet Res. (2023)

    Google Scholar 

  22. Mir, T.H., et al.: Deidentification and temporal normalisation of the electronic health record notes using large language models: the SREDH/AI-Cup 2023 deidentification competition. In: 2024 International Workshop on Deidentification of Electronic Medical Record Notes. Springer, Kaohsiung (2024)

    Google Scholar 

Download references

Acknowledgments

We thank the AI Cup 2023 organizers, particularly the Taiwan Ministry of Education, the Intelligent Systems Laboratory at National Kaohsiung University of Science and Technology, the Department of Bioinformatics and Medical Engineering at Asia University, and the SREDH Consortium, for their support and for providing the dataset crucial for our healthcare AI research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fan-Pin Tseng .

Editor information

Editors and Affiliations

Ethics declarations

The authors declare that they have no competing interests relevant to the content of this article to disclose.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tseng, FP., Ko, HC., Hou, XY., Wijaya, D., Chang, C.KL., Tsai, R.TH. (2025). Patient Privacy Information Retrieval with Longformer and CRF, Followed by Rule-Based Time Information Normalization: A Dual-Approach Study. In: Jonnagaddala, J., Dai, HJ., Chen, CT. (eds) Large Language Models for Automatic Deidentification of Electronic Health Record Notes. IW-DMRN 2024. Communications in Computer and Information Science, vol 2148. Springer, Singapore. https://doi.org/10.1007/978-981-97-7966-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-7966-6_11

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-7965-9

  • Online ISBN: 978-981-97-7966-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics