research-article

Open access

Deidentification of free-text medical records using pre-trained bidirectional transformers

Authors:

Alistair E. W. Johnson,

Lucas Bulgarelli,

Tom J. PollardAuthors Info & Claims

CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning

Pages 214 - 221

https://doi.org/10.1145/3368555.3384455

Published: 02 April 2020 Publication History

Abstract

The ability of caregivers and investigators to share patient data is fundamental to many areas of clinical practice and biomedical research. Prior to sharing, it is often necessary to remove identifiers such as names, contact details, and dates in order to protect patient privacy. Deidentification, the process of removing identifiers, is challenging, however. High-quality annotated data for developing models is scarce; many target identifiers are highly heterogenous (for example, there are uncountable variations of patient names); and in practice anything less than perfect sensitivity may be considered a failure. Consequently, software for adequately deidentifying clinical data is not widely available. As a result patient data is often withheld when sharing would be beneficial, and identifiable patient data is often divulged when a deidentified version would suffice.

In recent years, advances in machine learning methods have led to rapid performance improvements in natural language processing tasks, in particular with the advent of large-scale pretrained language models. In this paper we develop and evaluate an approach for deidentification of clinical notes based on a bidirectional transformer model. We propose human interpretable evaluation measures and demonstrate state of the art performance against modern baseline models. Finally, we highlight current challenges in deidentification, including the absence of clear annotation guidelines, lack of portability of models, and paucity of training data. Code to develop our model is open source and simple to install, allowing for broad reuse.

References

[1]

John Aberdeen, Samuel Bayer, Reyyan Yeniterzi, Ben Wellner, Cheryl Clark, David Hanauer, Bradley Malin, and Lynette Hirschman. 2010. The MITRE Identification Scrubber Toolkit: design, training, and assessment. International journal of medical informatics 79, 12 (2010), 849--859.

[2]

Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019).

[3]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019a. SciBERT: Pretrained Language Model for Scientific Text. In EMNLP. arXiv:1903.10676

[4]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019b. SciBERT: Pretrained Language Model for Scientific Text. In EMNLP. arXiv:1903.10676

[5]

Irene Y. Chen, Fredrik D. Johansson, and David Sontag. 2018. Why is My Classifier Discriminatory?. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, 3543--3554.

Digital Library

[6]

Franck Dernoncourt, Ji Young Lee, and Peter Szolovits. 2017a. NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. Conference on Empirical Methods on Natural Language Processing (EMNLP) (2017).

[7]

Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. 2017b. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association 24, 3 (2017), 596--606. 1527974X [arxiv]1606.03475

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.

[9]

Elizabeth Ford, Jessica Stockdale, Richard Jackson, and Jackie Cassell. 2017. For the greater good? Patient and public attitudes to use of medical free text data in research. International Journal of Population Data Science 1, 1 (2017), 229.

[10]

Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The Unreasonable Effectiveness of Data. Intelligent Systems, IEEE 24, 2 (2009), 8--12.

Digital Library

[11]

Tzvika Hartman, Michael D Howell, Jeff Dean, Shlomo Hoory, Ronit Slyper, Itay Laish, Oren Gilon, Danny Vainstein, Greg Corrado, Katherine Chou, et al. 2020. Customization scenarios for de-identification of clinical notes. BMC Medical Informatics and Decision Making 20, 1 (2020), 1--9.

[12]

Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3 (2016), 160035.

[13]

Alistair E W Johnson, Lucas Bulgarelli, and Tom J Pollard. 2020. BERT-deid: A BERT model for deidentification of free text notes.

[14]

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282--289.

Digital Library

[15]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).

[16]

Hee-Jin Lee, Yonghui Wu, Yaoyun Zhang, Jun Xu, Hua Xu, and Kirk Roberts. 2017. A hybrid approach to automatic de-identification of psychiatric notes. Journal of biomedical informatics 75 (2017), S19--S27.

Digital Library

[17]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019a. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (09 2019). 1367-4803

[18]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019b. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (09 2019). 1367-4803

[19]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[20]

Zengjian Liu, Yangxin Chen, Buzhou Tang, Xiaolong Wang, Qingcai Chen, Haodi Li, Jingfeng Wang, Qiwen Deng, and Suisong Zhu. 2015. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields. Journal of biomedical informatics 58 (2015), S47--S52.

Digital Library

[21]

Zengjian Liu, Buzhou Tang, Xiaolong Wang, and Qingcai Chen. 2017. De-identification of clinical notes via recurrent neural network and conditional random field. Journal of biomedical informatics 75 (2017), S34--S42.

Digital Library

[22]

Ishna Neamatullah, Margaret M Douglass, Li-wei H Lehman, Andrew Reisner, Mauricio Villarroel, William J Long, Peter Szolovits, George B Moody, Roger G Mark, and Gari D Clifford. 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8 (jan 2008), 32. 1472-6947

[23]

Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. 2017. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. (2017).

[24]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8026--8037. http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

Digital Library

[25]

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).

[26]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).

[27]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.

Digital Library

[28]

Martin Scaiano, Grant Middleton, Luk Arbuckle, Varada Kolhatkar, Liam Peyton, Moira Dowling, Debbie S Gipson, and Khaled El Emam. 2016. A unified framework for evaluating the risk of re-identification of text de-identification tools. Journal of biomedical informatics 63 (2016), 174--183.

Digital Library

[29]

Sacha Servan-schreiber, Olga Ohrimenko, Tim Kraska, and Emanuel Zgraggen. 2019. Custodes : Auditable Hypothesis Testing. arXiv (2019), 1--17. [arxiv]arXiv:1901.10875v1 https://arxiv.org/pdf/1901.10875.pdf

[30]

Jessica Stockdale, Jackie Cassell, and Elizabeth Ford. 2018. "Giving something back": A systematic review and ethical enquiry into public views on the use of patient data for research in the United Kingdom and the Republic of Ireland. Wellcome open research 3 (2018), 6.

[31]

Amber Stubbs, Michele Filannino, and Özlem Uzuner. 2017. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID Shared Tasks Track 1. Journal of biomedical informatics 75 (2017), S4--S18.

Digital Library

[32]

Amber Stubbs, Christopher Kotfila, and Özlem Uzuner. 2015. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of biomedical informatics 58 (2015), S11--S19.

Digital Library

[33]

Amber Stubbs and Özlem Uzuner. 2015a. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics 58 (2015), S20--S29.

Digital Library

[34]

Amber Stubbs and Özlem Uzuner. 2015b. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics 58 Suppl, Suppl (dec 2015), S20--9. 1532-0480 (Electronic)

[35]

Employee Benefits Security Administration U.S. Dept. of Labor. 2004. The Health Insurance Portability and Accountability Act (HIPAA). United States (2004). http://purl.fdlp.gov/GPO/gpo10291

[36]

Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association : JAMIA 14, 5 (2007), 550--563. 1067-5027 (Print)

[37]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.

[38]

Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. 2019. Do NLP Models Know Numbers? Probing Numeracy in Embeddings. In Empirical Methods in Natural Language Processing.

[39]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. [arxiv]cs.CL/1910.03771

[40]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[41]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS.

[42]

Vithya Yogarajan, Michael Mayo, and Bernhard Pfahringer. 2018. A survey of automatic de-identification of longitudinal clinical narratives. arXiv preprint arXiv:1810.06765 (2018).

Cited By

Arzideh KBaldini GWinnekens PFriedrich CNensa FIdrissi-Yaghir AHosch R(2025)A Transformer-Based Pipeline for German Clinical Document De-IdentificationApplied Clinical Informatics10.1055/a-2424-198916:01(031-043)Online publication date: 8-Jan-2025
https://doi.org/10.1055/a-2424-1989
Tseng FKo HHou XWijaya DChang CTsai R(2025)Patient Privacy Information Retrieval with Longformer and CRF, Followed by Rule-Based Time Information Normalization: A Dual-Approach StudyLarge Language Models for Automatic Deidentification of Electronic Health Record Notes10.1007/978-981-97-7966-6_11(148-161)Online publication date: 26-Jan-2025
https://doi.org/10.1007/978-981-97-7966-6_11
Meldau EBista SMelgarejo-González CNorén G(2024)Automated redaction of names in adverse event reports using transformer-based neural networksBMC Medical Informatics and Decision Making10.1186/s12911-024-02785-924:1Online publication date: 23-Dec-2024
https://doi.org/10.1186/s12911-024-02785-9
Show More Cited By

Index Terms

Deidentification of free-text medical records using pre-trained bidirectional transformers
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Annotation

Recommendations

An HL7 Data Pseudonymization Pipeline
ICHI '15: Proceedings of the 2015 International Conference on Healthcare Informatics

The increasing uptake of information technology in the healthcare domain has resulted in a large volume of digital health data being generated on a regular basis. Most of the health information systems exchange information using HL7 messages making HL7 ...
Healthcare Privacy: How Secure Are the VOIP/Video-Conferencing Tools for PHI Data?
ITNG '15: Proceedings of the 2015 12th International Conference on Information Technology - New Generations

There is a high-tech term called telemedicine, which uses information technologies and telecommunication for exchanging medical information among patients and health service providers from different locations. Many video conferencing tools such as WebEx,...
Design and application of a Health Insurance Portability and Accountability Act-compliant privacy framework for pervasive healthcare

With an increasing emphasis on pervasive healthcare services, providing a high degree of privacy to patients is becoming a major challenge due to: (a) an increased number of avenues, such as device, access points, switches and database; (b) more threats ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning

April 2020

265 pages

ISBN:9781450370462

DOI:10.1145/3368555

General Chair:
Marzyeh Ghassemi
University of Toronto and the Vector Institute

Copyright © 2020 Owner/Author.

This work is licensed under a Creative Commons Attribution-NoDerivs International 4.0 License.

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NIH

Conference

ACM CHIL '20

Sponsor:

ACM

ACM CHIL '20: ACM Conference on Health, Inference, and Learning

April 2 - 4, 2020

Ontario, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 27 of 110 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
2,410
Total Downloads

Downloads (Last 12 months)460
Downloads (Last 6 weeks)55

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Arzideh KBaldini GWinnekens PFriedrich CNensa FIdrissi-Yaghir AHosch R(2025)A Transformer-Based Pipeline for German Clinical Document De-IdentificationApplied Clinical Informatics10.1055/a-2424-198916:01(031-043)Online publication date: 8-Jan-2025
https://doi.org/10.1055/a-2424-1989
Tseng FKo HHou XWijaya DChang CTsai R(2025)Patient Privacy Information Retrieval with Longformer and CRF, Followed by Rule-Based Time Information Normalization: A Dual-Approach StudyLarge Language Models for Automatic Deidentification of Electronic Health Record Notes10.1007/978-981-97-7966-6_11(148-161)Online publication date: 26-Jan-2025
https://doi.org/10.1007/978-981-97-7966-6_11
Meldau EBista SMelgarejo-González CNorén G(2024)Automated redaction of names in adverse event reports using transformer-based neural networksBMC Medical Informatics and Decision Making10.1186/s12911-024-02785-924:1Online publication date: 23-Dec-2024
https://doi.org/10.1186/s12911-024-02785-9
Madan SLentzen MBrandt JRueckert DHofmann-Apitius MFröhlich H(2024)Transformer models in biomedicineBMC Medical Informatics and Decision Making10.1186/s12911-024-02600-524:1Online publication date: 29-Jul-2024
https://doi.org/10.1186/s12911-024-02600-5
Staufer DPallas FBerendt B(2024)Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-IdentificationProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658936(733-745)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658936
Kovačević ABašaragin BMilošević NNenadić G(2024)De-identification of clinical free text using natural language processingArtificial Intelligence in Medicine10.1016/j.artmed.2024.102845151:COnline publication date: 1-May-2024
https://dl.acm.org/doi/10.1016/j.artmed.2024.102845
Manzanares-Salor BSánchez DLison P(2024)Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attackData Mining and Knowledge Discovery10.1007/s10618-024-01066-338:6(4040-4075)Online publication date: 3-Sep-2024
https://doi.org/10.1007/s10618-024-01066-3
Wu ZShu PLi YLi QLiu TLi X(2024)Robot Control via Natural Instructions Empowered by Large Language ModelDiscovering the Frontiers of Human-Robot Interaction10.1007/978-3-031-66656-8_19(437-457)Online publication date: 24-Jul-2024
https://doi.org/10.1007/978-3-031-66656-8_19
Kreuzthaler MBrochhausen MZayas CBlobel BSchulz S(2023)Linguistic and ontological challenges of multiple domains contributing to transformed health ecosystemsFrontiers in Medicine10.3389/fmed.2023.107331310Online publication date: 15-Mar-2023
https://doi.org/10.3389/fmed.2023.1073313
Liu LPerez-Concha ONguyen ABennett VBlake VGallego BJorm L(2023)Web-Based Application Based on Human-in-the-Loop Deep Learning for Deidentifying Free-Text Data in Electronic Medical Records: Development and Usability StudyInteractive Journal of Medical Research10.2196/4632212(e46322)Online publication date: 25-Aug-2023
https://doi.org/10.2196/46322
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten