Patient Privacy Information Retrieval with Longformer and CRF, Followed by Rule-Based Time Information Normalization: A Dual-Approach Study

Tseng, Fan-Pin; Ko, Han-Chun; Hou, Xiu-Yu; Wijaya, Danang; Chang, Connyn Kang-Lin; Tsai, Richard Tzong-Han

doi:10.1007/978-981-97-7966-6_11

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2148))

Included in the following conference series:

International Workshop on Deidentification of Electronic Medical Record Notes

3 Accesses

Abstract

This study explores integrating the Longformer model with Conditional Random Fields (CRF) for enhancing Named Entity Recognition (NER) in the domain of healthcare data processing. It specifically focuses on patient privacy information retrieval and time information normalization, utilizing the comprehensive ‘Artificial Intelligence CUP 2023: Privacy Protection and Medical Data Standardization Challenge Dataset’. This research is conducted within the context of the AI CUP 2023 competition, which is dedicated to the privacy protection and standardization of medical data. Our approach utilized the Longformer model, renowned for its effectiveness in handling extensive text sequences, and combined it with CRF to enhance entity recognition accuracy in Electronic Health Record (EHR) text notes. To tackle challenges such as lengthy texts and class distribution imbalances, we developed a specialized process for managing large-scale textual data. This involved segmenting extensive texts into manageable chunks of 4,096 characters, which allowed for more focused and efficient training. For prediction, we employed a sliding window technique to ensure seamless integration and analysis of these text segments. This strategy was crucial in accurately retrieving patient privacy information from lengthy healthcare records. Additionally, our methodology included the implementation of rule-based methods for time information normalization, further enhancing the applicability of our approach in the medical data domain. The combination of Longformer and CRF has proven effective in accurately identifying sensitive patient information and normalizing time-related data. This approach illustrates the synergy between deep learning models and traditional methods, showcasing a robust framework for enhancing the security and efficiency of healthcare data processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.99; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Liu, J., Li, C., Liu, S.: Utility of ChatGPT in clinical practice. J. Med. Internet Res. 25 (2023)
Google Scholar
Qiu, J., et al.: Large AI models in health informatics: applications, challenges, and the future. IEEE J. Biomed. Health Inform. 27, 6074–6087 (2023)
Article MATH Google Scholar
Meskó, B., Topol, E.J.: The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit. Med. 6, 120 (2023)
Google Scholar
Li, Y., Wehbe, R.M., Ahmad, F.S., Wang, H., Luo, Y.: A comparative study of pretrained language models for long clinical text. J. Am. Med. Inform. Assoc. 30, 340–347 (2022)
Article MATH Google Scholar
Johnson, A.E.W., Bulgarelli, L., Pollard, T.J.: Deidentification of free-text medical records using pre-trained bidirectional transformers. In: CHIL ‘20, pp. 214–221. Association for Computing Machinery, New York, NY, USA (2020)
Google Scholar
Meaney, C., Hakimpour, W., Kalia, S., Moineddin, R.: A comparative evaluation of transformer models for de-identification of clinical text data. arXiv:2204.07056 (2022)
Catelli, R., Gargiulo, F., Damiano, E., Esposito, M., De Pietro, G.: Clinical De-identification using sub-document analysis and ELECTRA. In: 2021 IEEE International Conference on Digital Health (ICDH), pp. 266–275 (2021)
Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticæ Investigationes 30, 3–26 (2007)
Article MATH Google Scholar
Keretna, S., Lim, C.P., Creighton, D.: A Hybrid Model for Named Entity Recognition Using Unstructured Medical Text, pp. 85–90 (2014)
Google Scholar
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv:2004.05150 (2020)
Li, Y., Wehbe, R.M., Ahmad, F.S., Wang, H., Luo, Y.: Clinical-longformer and clinical-BigBird: transformers for long clinical sequences. arXiv:2201.11838 (2022)
Crichton, G.K.O., Pyysalo, S., Chiu, B., Korhonen, A.: A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinform. 18 (2017)
Google Scholar
Knafou, J., Naderi, N., Copara, J., Teodoro, D., Ruch, P.: BiTeM at WNUT 2020 shared task-1: named entity recognition over wet lab protocols using an ensemble of contextual language models. In: Proceedings of the Workshop on Noisy User-Generated Text (W-NUT 2020), pp. 305–313 (2020)
Google Scholar
Schneider, E.T., Zavala, R.M.R., Martínez, P., Moro, C., Paraiso, E.: UC3M-PUCPR at SemEval-2022 task 11: an ensemble method of transformer-based models for complex named entity recognition. In: Proceedings of the International Workshop on Semantic Evaluation (SemEval-2022), pp. 1448–1456 (2022)
Google Scholar
SREDH/AI-Cup 2023: SREDH/AI-Cup 2023 Deidentification Competition. https://www.sredhconsortium.org/sredh-competitions/sredhai-cup-2023
Secure Research Environment for Digital Health (SREDH) Consortium. https://www.sredhconsortium.org/
Chen, A., Jonnagaddala, J., Nekkantti, C., Liaw, S.-T.: Generation of surrogates for de-identification of electronic health records. In: Studies in Health Technology and Informatics. MEDINFO 2019: Health and Wellbeing e-Networks for All, vol. 264, pp. 70–73. IOS Press (2019)
Google Scholar
Jonnagaddala, J., Chen, A., Batongbacal, S., Nekkantti, C.: The OpenDeID corpus for patient de-identification. Sci. Rep. 11, 19973 (2021)
Google Scholar
CodaLab: Protection and Medical Data Standardization Competition: Decoding Clinical Cases, Letting Data Tell the Story. https://codalab.lisn.upsaclay.fr/competitions/15425
Alla, N.L.V., Chen, A., Batongbacal, S., Nekkantti, C., Dai, H.-J., Jonnagaddala, J.: Cohort selection for construction of a clinical natural language processing corpus. Comput. Methods Programs Biomed. Update 1, 100024 (2021)
Article Google Scholar
Liu, J., et al.: OpenDeID pipeline for unstructured electronic health record text notes based on rules and transformers: deidentification algorithm development and validation study. J. Med. Internet Res. (2023)
Google Scholar
Mir, T.H., et al.: Deidentification and temporal normalisation of the electronic health record notes using large language models: the SREDH/AI-Cup 2023 deidentification competition. In: 2024 International Workshop on Deidentification of Electronic Medical Record Notes. Springer, Kaohsiung (2024)
Google Scholar

Download references

Acknowledgments

We thank the AI Cup 2023 organizers, particularly the Taiwan Ministry of Education, the Intelligent Systems Laboratory at National Kaohsiung University of Science and Technology, the Department of Bioinformatics and Medical Engineering at Asia University, and the SREDH Consortium, for their support and for providing the dataset crucial for our healthcare AI research.

Author information

Authors and Affiliations

National Central University, No. 300, Zhongda Road, Zhongli District, Taoyuan, 320, Taiwan (R.O.C.)
Fan-Pin Tseng, Han-Chun Ko, Xiu-Yu Hou, Danang Wijaya, Connyn Kang-Lin Chang & Richard Tzong-Han Tsai
National Atomic Research Institute, No. 1000, Wenhua Road, Jiaan Village, Longtan District, Taoyuan, 32546, Taiwan (R.O.C.)
Fan-Pin Tseng

Authors

Fan-Pin Tseng
View author publications
You can also search for this author in PubMed Google Scholar
Han-Chun Ko
View author publications
You can also search for this author in PubMed Google Scholar
Xiu-Yu Hou
View author publications
You can also search for this author in PubMed Google Scholar
Danang Wijaya
View author publications
You can also search for this author in PubMed Google Scholar
Connyn Kang-Lin Chang
View author publications
You can also search for this author in PubMed Google Scholar
Richard Tzong-Han Tsai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fan-Pin Tseng .

Editor information

Editors and Affiliations

University of New South Wales, Sydney, NSW, Australia
Jitendra Jonnagaddala
National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
Hong-Jie Dai
Asia University, Taichung, Taiwan
Ching-Tai Chen

Ethics declarations

The authors declare that they have no competing interests relevant to the content of this article to disclose.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tseng, FP., Ko, HC., Hou, XY., Wijaya, D., Chang, C.KL., Tsai, R.TH. (2025). Patient Privacy Information Retrieval with Longformer and CRF, Followed by Rule-Based Time Information Normalization: A Dual-Approach Study. In: Jonnagaddala, J., Dai, HJ., Chen, CT. (eds) Large Language Models for Automatic Deidentification of Electronic Health Record Notes. IW-DMRN 2024. Communications in Computer and Information Science, vol 2148. Springer, Singapore. https://doi.org/10.1007/978-981-97-7966-6_11

Download citation

DOI: https://doi.org/10.1007/978-981-97-7966-6_11
Published: 26 January 2025
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-7965-9
Online ISBN: 978-981-97-7966-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics