Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3368555.3384455acmconferencesArticle/Chapter ViewAbstractPublication PageschilConference Proceedingsconference-collections
research-article
Open access

Deidentification of free-text medical records using pre-trained bidirectional transformers

Published: 02 April 2020 Publication History

Abstract

The ability of caregivers and investigators to share patient data is fundamental to many areas of clinical practice and biomedical research. Prior to sharing, it is often necessary to remove identifiers such as names, contact details, and dates in order to protect patient privacy. Deidentification, the process of removing identifiers, is challenging, however. High-quality annotated data for developing models is scarce; many target identifiers are highly heterogenous (for example, there are uncountable variations of patient names); and in practice anything less than perfect sensitivity may be considered a failure. Consequently, software for adequately deidentifying clinical data is not widely available. As a result patient data is often withheld when sharing would be beneficial, and identifiable patient data is often divulged when a deidentified version would suffice.
In recent years, advances in machine learning methods have led to rapid performance improvements in natural language processing tasks, in particular with the advent of large-scale pretrained language models. In this paper we develop and evaluate an approach for deidentification of clinical notes based on a bidirectional transformer model. We propose human interpretable evaluation measures and demonstrate state of the art performance against modern baseline models. Finally, we highlight current challenges in deidentification, including the absence of clear annotation guidelines, lack of portability of models, and paucity of training data. Code to develop our model is open source and simple to install, allowing for broad reuse.

References

[1]
John Aberdeen, Samuel Bayer, Reyyan Yeniterzi, Ben Wellner, Cheryl Clark, David Hanauer, Bradley Malin, and Lynette Hirschman. 2010. The MITRE Identification Scrubber Toolkit: design, training, and assessment. International journal of medical informatics 79, 12 (2010), 849--859.
[2]
Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019).
[3]
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019a. SciBERT: Pretrained Language Model for Scientific Text. In EMNLP. arXiv:1903.10676
[4]
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019b. SciBERT: Pretrained Language Model for Scientific Text. In EMNLP. arXiv:1903.10676
[5]
Irene Y. Chen, Fredrik D. Johansson, and David Sontag. 2018. Why is My Classifier Discriminatory?. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, 3543--3554.
[6]
Franck Dernoncourt, Ji Young Lee, and Peter Szolovits. 2017a. NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. Conference on Empirical Methods on Natural Language Processing (EMNLP) (2017).
[7]
Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. 2017b. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association 24, 3 (2017), 596--606. 1527974X [arxiv]1606.03475
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
[9]
Elizabeth Ford, Jessica Stockdale, Richard Jackson, and Jackie Cassell. 2017. For the greater good? Patient and public attitudes to use of medical free text data in research. International Journal of Population Data Science 1, 1 (2017), 229.
[10]
Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The Unreasonable Effectiveness of Data. Intelligent Systems, IEEE 24, 2 (2009), 8--12.
[11]
Tzvika Hartman, Michael D Howell, Jeff Dean, Shlomo Hoory, Ronit Slyper, Itay Laish, Oren Gilon, Danny Vainstein, Greg Corrado, Katherine Chou, et al. 2020. Customization scenarios for de-identification of clinical notes. BMC Medical Informatics and Decision Making 20, 1 (2020), 1--9.
[12]
Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3 (2016), 160035.
[13]
Alistair E W Johnson, Lucas Bulgarelli, and Tom J Pollard. 2020. BERT-deid: A BERT model for deidentification of free text notes.
[14]
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282--289.
[15]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
[16]
Hee-Jin Lee, Yonghui Wu, Yaoyun Zhang, Jun Xu, Hua Xu, and Kirk Roberts. 2017. A hybrid approach to automatic de-identification of psychiatric notes. Journal of biomedical informatics 75 (2017), S19--S27.
[17]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019a. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (09 2019). 1367-4803
[18]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019b. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (09 2019). 1367-4803
[19]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[20]
Zengjian Liu, Yangxin Chen, Buzhou Tang, Xiaolong Wang, Qingcai Chen, Haodi Li, Jingfeng Wang, Qiwen Deng, and Suisong Zhu. 2015. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields. Journal of biomedical informatics 58 (2015), S47--S52.
[21]
Zengjian Liu, Buzhou Tang, Xiaolong Wang, and Qingcai Chen. 2017. De-identification of clinical notes via recurrent neural network and conditional random field. Journal of biomedical informatics 75 (2017), S34--S42.
[22]
Ishna Neamatullah, Margaret M Douglass, Li-wei H Lehman, Andrew Reisner, Mauricio Villarroel, William J Long, Peter Szolovits, George B Moody, Roger G Mark, and Gari D Clifford. 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8 (jan 2008), 32. 1472-6947
[23]
Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. 2017. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. (2017).
[24]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8026--8037. http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[25]
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).
[26]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).
[27]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.
[28]
Martin Scaiano, Grant Middleton, Luk Arbuckle, Varada Kolhatkar, Liam Peyton, Moira Dowling, Debbie S Gipson, and Khaled El Emam. 2016. A unified framework for evaluating the risk of re-identification of text de-identification tools. Journal of biomedical informatics 63 (2016), 174--183.
[29]
Sacha Servan-schreiber, Olga Ohrimenko, Tim Kraska, and Emanuel Zgraggen. 2019. Custodes : Auditable Hypothesis Testing. arXiv (2019), 1--17. [arxiv]arXiv:1901.10875v1 https://arxiv.org/pdf/1901.10875.pdf
[30]
Jessica Stockdale, Jackie Cassell, and Elizabeth Ford. 2018. "Giving something back": A systematic review and ethical enquiry into public views on the use of patient data for research in the United Kingdom and the Republic of Ireland. Wellcome open research 3 (2018), 6.
[31]
Amber Stubbs, Michele Filannino, and Özlem Uzuner. 2017. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID Shared Tasks Track 1. Journal of biomedical informatics 75 (2017), S4--S18.
[32]
Amber Stubbs, Christopher Kotfila, and Özlem Uzuner. 2015. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of biomedical informatics 58 (2015), S11--S19.
[33]
Amber Stubbs and Özlem Uzuner. 2015a. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics 58 (2015), S20--S29.
[34]
Amber Stubbs and Özlem Uzuner. 2015b. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics 58 Suppl, Suppl (dec 2015), S20--9. 1532-0480 (Electronic)
[35]
Employee Benefits Security Administration U.S. Dept. of Labor. 2004. The Health Insurance Portability and Accountability Act (HIPAA). United States (2004). http://purl.fdlp.gov/GPO/gpo10291
[36]
Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association : JAMIA 14, 5 (2007), 550--563. 1067-5027 (Print)
[37]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.
[38]
Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. 2019. Do NLP Models Know Numbers? Probing Numeracy in Embeddings. In Empirical Methods in Natural Language Processing.
[39]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. [arxiv]cs.CL/1910.03771
[40]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[41]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS.
[42]
Vithya Yogarajan, Michael Mayo, and Bernhard Pfahringer. 2018. A survey of automatic de-identification of longitudinal clinical narratives. arXiv preprint arXiv:1810.06765 (2018).

Cited By

View all
  • (2025)A Transformer-Based Pipeline for German Clinical Document De-IdentificationApplied Clinical Informatics10.1055/a-2424-198916:01(031-043)Online publication date: 8-Jan-2025
  • (2025)Patient Privacy Information Retrieval with Longformer and CRF, Followed by Rule-Based Time Information Normalization: A Dual-Approach StudyLarge Language Models for Automatic Deidentification of Electronic Health Record Notes10.1007/978-981-97-7966-6_11(148-161)Online publication date: 26-Jan-2025
  • (2024)Automated redaction of names in adverse event reports using transformer-based neural networksBMC Medical Informatics and Decision Making10.1186/s12911-024-02785-924:1Online publication date: 23-Dec-2024
  • Show More Cited By

Index Terms

  1. Deidentification of free-text medical records using pre-trained bidirectional transformers

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning
    April 2020
    265 pages
    ISBN:9781450370462
    DOI:10.1145/3368555
    This work is licensed under a Creative Commons Attribution-NoDerivs International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 April 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. HIPAA
    2. PHI
    3. deidentification
    4. electronic health records
    5. medical informatics
    6. named entity recognition
    7. natural language processing
    8. neural networks

    Qualifiers

    • Research-article

    Funding Sources

    • NIH

    Conference

    ACM CHIL '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 27 of 110 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)460
    • Downloads (Last 6 weeks)55
    Reflects downloads up to 02 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)A Transformer-Based Pipeline for German Clinical Document De-IdentificationApplied Clinical Informatics10.1055/a-2424-198916:01(031-043)Online publication date: 8-Jan-2025
    • (2025)Patient Privacy Information Retrieval with Longformer and CRF, Followed by Rule-Based Time Information Normalization: A Dual-Approach StudyLarge Language Models for Automatic Deidentification of Electronic Health Record Notes10.1007/978-981-97-7966-6_11(148-161)Online publication date: 26-Jan-2025
    • (2024)Automated redaction of names in adverse event reports using transformer-based neural networksBMC Medical Informatics and Decision Making10.1186/s12911-024-02785-924:1Online publication date: 23-Dec-2024
    • (2024)Transformer models in biomedicineBMC Medical Informatics and Decision Making10.1186/s12911-024-02600-524:1Online publication date: 29-Jul-2024
    • (2024)Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-IdentificationProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658936(733-745)Online publication date: 3-Jun-2024
    • (2024)De-identification of clinical free text using natural language processingArtificial Intelligence in Medicine10.1016/j.artmed.2024.102845151:COnline publication date: 1-May-2024
    • (2024)Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attackData Mining and Knowledge Discovery10.1007/s10618-024-01066-338:6(4040-4075)Online publication date: 3-Sep-2024
    • (2024)Robot Control via Natural Instructions Empowered by Large Language ModelDiscovering the Frontiers of Human-Robot Interaction10.1007/978-3-031-66656-8_19(437-457)Online publication date: 24-Jul-2024
    • (2023)Linguistic and ontological challenges of multiple domains contributing to transformed health ecosystemsFrontiers in Medicine10.3389/fmed.2023.107331310Online publication date: 15-Mar-2023
    • (2023)Web-Based Application Based on Human-in-the-Loop Deep Learning for Deidentifying Free-Text Data in Electronic Medical Records: Development and Usability StudyInteractive Journal of Medical Research10.2196/4632212(e46322)Online publication date: 25-Aug-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media