research-article

Differentially Private Medical Texts Generation Using Generative Neural Networks

Authors:

Md Momin Al Aziz,

Xiaoqian Jiang,

Noman MohammedAuthors Info & Claims

ACM Transactions on Computing for Healthcare (HEALTH), Volume 3, Issue 1

Article No.: 5, Pages 1 - 27

https://doi.org/10.1145/3469035

Published: 15 October 2021 Publication History

Abstract

Technological advancements in data science have offered us affordable storage and efficient algorithms to query a large volume of data. Our health records are a significant part of this data, which is pivotal for healthcare providers and can be utilized in our well-being. The clinical note in electronic health records is one such category that collects a patient’s complete medical information during different timesteps of patient care available in the form of free-texts. Thus, these unstructured textual notes contain events from a patient’s admission to discharge, which can prove to be significant for future medical decisions. However, since these texts also contain sensitive information about the patient and the attending medical professionals, such notes cannot be shared publicly. This privacy issue has thwarted timely discoveries on this plethora of untapped information. Therefore, in this work, we intend to generate synthetic medical texts from a private or sanitized (de-identified) clinical text corpus and analyze their utility rigorously in different metrics and levels. Experimental results promote the applicability of our generated data as it achieves more than \(80\%\) accuracy in different pragmatic classification problems and matches (or outperforms) the original text data.

References

[1]

Neil Mehta and Murthy V. Devarakonda. 2018. Machine learning, natural language programming, and electronic health records: The next step in the artificial intelligence journey?Journal of Allergy and Clinical Immunology 141, 6 (2018), 2019–2021.

[2]

Stephen Wu, Kirk Roberts, Surabhi Datta, Jingcheng Du, Zongcheng Ji, Yuqi Si, Sarvesh Soni, et al. 2020. Deep learning in clinical natural language processing: A methodical review. Journal of the American Medical Informatics Association 27, 3 (2020), 457–470.

[3]

Yanshan Wang, Liwei Wang, Majid Rastegar-Mojarad, Sungrim Moon, Feichen Shen, Naveed Afzal, Sijia Liu, et al. 2018. Journal of Biomedical Informatics 77 (2018), 34–49.

[4]

Robert H. Miller and Ida Sim. 2004. Physicians’ use of electronic medical records: Barriers and solutions. Health Affairs 23, 2 (2004), 116–126.

[5]

Josh Nisker. 2006. Pipeda: A constitutional analysis. Canadian Bar Review 85 (2006), 317.

[6]

George J. Annas. 2003. HIPAA regulations-a new era of medical-record privacy?New England Journal of Medicine 348, 15 (2003), 1486–1490.

[7]

Accountability Act. 1996. Health Insurance Portability and Accountability Act of 1996. Public Law 104 (1996), 191.

[8]

Md. Momin Al Aziz, Md. Nazmus Sadat, Dima Alhadidi, Shuang Wang, Xiaoqian Jiang, Cheryl L. Brown, and Noman Mohammed. 2017. Privacy-preserving techniques of genomic data—A survey. Briefings in Bioinformatics 20, 3 (2017), 887–895.

[9]

Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. 2016. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association 24, 3 (12 2016), 596–606. DOI:

[10]

Tanbir Ahmed, Md. Momin Al Aziz, and Noman Mohammed. 2020. De-identification of electronic health record using neural network. Scientific Reports 10, 1 (2020), 1–11.

[11]

M. Douglass, G. D. Clifford, A. Reisner, G. B. Moody, and R. G. Mark. 2004. Computer-assisted de-identification of free text in the MIMIC II database. In Proceedings of Computers in Cardiology (CinC’04).

[12]

M. M. Douglass, G. D. Cliffford, Andrew Reisner, W. J. Long, G. B. Moody, and R. G. Mark. 2005. De-identification algorithm for free-text nursing notes. In Proceedings of Computers in Cardiology (CinC’05).

[13]

Ishna Neamatullah, Margaret M. Douglass, Li-Wei H. Lehman, et al. 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8 (2008), Article 32.

[14]

Kaung Khin, Philipp Burckhardt, and Rema Padman. 2018. A deep learning architecture for de-identification of patient notes: Implementation and evaluation. arXiv:1810.01570.

[15]

Vithya Yogarajan, Bernhard Pfahringer, and Michael Mayo. 2020. A review of automatic end-to-end de-identification: Is high accuracy the only metric?Applied Artificial Intelligence 34, 3 (2020), 251–269.

[16]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.

[17]

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence generative adversarial nets with policy gradient. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.

Digital Library

[18]

Simson L. Garfinkel. 2015. De-Identification of Personal Information. National Institute of Standards and Technology.

[19]

Kimberly J. O’Malley, Karon F. Cook, Matt D. Price, Kimberly Raiford Wildes, John F. Hurdle, and Carol M. Ashton. 2005. Measuring diagnoses: ICD code accuracy. Health Services Research 40, 5p2 (2005), 1620–1639.

[20]

Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.

Digital Library

[21]

Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. 2016. Semi-supervised knowledge transfer for deep learning from private training data. arXiv:1610.05755.

[22]

Amber Stubbs, Christopher Kotfila, and Özlem Uzuner. 2015. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of Biomedical Information 58 (2015), S11–S19.

Digital Library

[23]

Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-Wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Nature Scientific Data 3 (2016), Article 160035.

[24]

Fei Song and W. Bruce Croft. 1999. A general language model for information retrieval. In Proceedings of the 8th International Conference on Information and Knowledge Management. 316–321.

Digital Library

[25]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.

[26]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3 (2003), 1137–1155.

Digital Library

[27]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.

Digital Library

[28]

Cynthia Dwork. 2006. Differential privacy. In Proceedings of the 33rd International Conference on Automata, Languages, and Programming—Volume Part II (ICALP’06). 1–12.

Digital Library

[29]

Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211–407.

Digital Library

[30]

Gavin Kerrigan, Dylan Slack, and Jens Tuyls. 2020. Differentially private language models benefit from public pre-training. arXiv:2009.05886.

[31]

Oren Melamud and Chaitanya Shivade. 2019. Towards automatic generation of shareable synthetic clinical notes using neural language models. In Proceedings of the 2nd Clinical Natural Language Processing Workshop. 35–45.

[32]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781.

[33]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.

[34]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.

Digital Library

[35]

C. Plattel. 2014. Distributed and Incremental Clustering Using Shared Nearest Neighbours. Ph.D. Dissertation. Utrecht University.

[36]

Ted Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 1 (1993), 61–74. https://www.aclweb.org/anthology/J93-1003.

Digital Library

[37]

Paul Rayson, Damon Berridge, and Brian Francis. 2004. Extending the Cochran rule for the comparison of word frequencies between corpora. In Proceedings of the 7th International Conference on Statistical Analysis of Textual Data (JADT’04). 926–936.

[38]

Paul Rayson and Roger Garside. 2000. Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing Corpora—Volume 9. 1–6.

Digital Library

[39]

Jefrey Lijffijt, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki, and Heikki Mannila. 2016. Significance testing of word frequencies in corpora. Literary and Linguistic Computing 31, 2 (2016), 374–397.

[40]

Janis E. Johnston, Kenneth J. Berry, and Paul W. Mielke Jr. 2006. Measures of effect size for chi-squared and likelihood-ratio goodness-of-fit tests. Perceptual and Motor Skills 103, 2 (2006), 412–414.

[41]

Homa Alemzadeh and Murthy Devarakonda. 2017. An NLP-based cognitive system for disease status identification in electronic health records. In Proceedings of the 2017 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI’17). IEEE, Los Alamitos, CA, 89–92.

[42]

Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, and Deepak Verma. 2004. Adversarial classification. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 99–108.

Digital Library

[43]

Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. arXiv:1701.06547.

[44]

Ishna Neamatullah, Margaret M. Douglass, Li-Wei H. Lehman, Andrew Reisner, Mauricio Villarroel, William J. Long, Peter Szolovits, George B. Moody, Roger G. Mark, and Gari D. Clifford. 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8, 1 (2008), 32.

[45]

James Jordon, Jinsung Yoon, and Mihaela van der Schaar. 2019. PATE-GAN: Generating synthetic data with differential privacy guarantees. In Proceedings of the 2019 International Conference on Learning Representations (ICLR’19).

[46]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language models are few-shot learners. arXiv:2005.14165.

[47]

Scott H. Lee. 2018. Natural language generation for electronic health records. NPJ Digital Medicine 1, 1 (2018), 1–7.

[48]

Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2018. The secret sharer: Evaluating and testing unintended memorization in neural networks. arXiv:1802.08232.

[49]

Caitlin Dreisbach, Theresa A. Koleck, Philip E. Bourne, and Suzanne Bakken. 2019. A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data. International Journal of Medical Informatics 125 (2019), 37–46.

[50]

Cao Xiao, Edward Choi, and Jimeng Sun. 2018. Opportunities and challenges in developing deep learning models using electronic health records data: A systematic review. Journal of the American Medical Informatics Association 25, 10 (2018), 1419–1428.

[51]

Benjamin Shickel, Patrick James Tighe, Azra Bihorac, and Parisa Rashidi. 2017. Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE Journal of Biomedical and Health Informatics 22, 5 (2017), 1589–1604.

[52]

Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational Bayes. arXiv:1312.6114.

[53]

Danilo Jimenez Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. arXiv:1505.05770.

[54]

Carl Doersch. 2016. Tutorial on variational autoencoders. arXiv:1606.05908.

[55]

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating sentences from a continuous space. arXiv:1511.06349.

[56]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672–2680.

Digital Library

[57]

Alexandre Yahi, Rami Vanguri, Noémie Elhadad, and Nicholas P. Tatonetti. 2017. Generative adversarial networks for electronic health records: a framework for exploring and evaluating methods for predicting drug-induced laboratory test trajectories. arXiv:1712.00164.

[58]

Cristóbal Esteban, Stephanie L. Hyland, and Gunnar Rätsch. 2017. Real-valued (medical) time series generation with recurrent conditional GANs. arXiv:1706.02633.

[59]

Jiaqi Guan, Runzhe Li, Sheng Yu, and Xuegong Zhang. 2018. Generation of synthetic electronic medical record text. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM’18). IEEE, Los Alamitos, CA, 374–380.

[60]

William Fedus, Ian Goodfellow, and Andrew M. Dai. 2018. MaskGAN: Better text generation via filling in the. arXiv:1801.07736.

[61]

Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Úlfar Erlingsson. 2018. Scalable private learning with PATE. arXiv:1802.08908.

[62]

Julia Ive, Natalia Viani, Joyce Kam, Lucia Yin, Somain Verma, Stephen Puntis, Rudolf N. Cardinal, Angus Roberts, Robert Stewart, and Sumithra Velupillai. 2020. Generation and evaluation of artificial mental health records for natural language processing. NPJ Digital Medicine 3, 1 (2020), 1–9.

[63]

M. M. Douglass, G. D. Cliffford, Andrew Reisner, W. J. Long, G. B. Moody, and R. G. Mark. 2005. De-identification algorithm for free-text nursing notes. In Proceedings of Computers in Cardiology. IEEE, Los Alamitos, CA, 331–334.

[64]

Özlem Uzuner and Amber Stubbs. 2015. Practical applications for natural language processing in clinical research: The 2014 i2b2/UTHealth shared tasks. Journal of Biomedical Informatics 58, Suppl. (2015), S1.

Digital Library

[65]

Hui Yang and Jonathan M. Garibaldi. 2015. Automatic detection of protected health information from clinic narratives. Journal of Biomedical Informatics 58 (2015), S30–S38.

Digital Library

[66]

John Aberdeen, Samuel Bayer, Reyyan Yeniterzi, Ben Wellner, Cheryl Clark, David Hanauer, Bradley Malin, and Lynette Hirschman. 2010. The MITRE Identification Scrubber Toolkit: Design, training, and assessment. International Journal of Medical Informatics 79, 12 (2010), 849–859.

[67]

Tao Chen, Richard M. Cullen, and Marshall Godwin. 2015. Hidden Markov model using Dirichlet process for de-identification. Journal of Biomedical Informatics 58 (2015), S60–S66.

Digital Library

[68]

Mohammad Naseri, Jamie Hayes, and Emiliano De Cristofaro. 2020. Toward robustness and privacy in federated learning: Experimenting with local and central differential privacy. arXiv:2009.03561.

[69]

Nicolas Papernot and Ian Goodfellow. 2018. Privacy and machine learning: Two unexpected allies?Cleverhans Blog. Retrieved August 9, 2021 from http://www.cleverhans.io/privacy/2018/04/29/privacy-and-machine-learning.html.

[70]

Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. 2013. Stochastic gradient descent with differentially private updates. In Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing. IEEE, Los Alamitos, CA, 245–248.

[71]

Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 1310–1321.

Digital Library

[72]

Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv:1609.04747.

[73]

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 308–318.

Digital Library

[74]

H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. 2017. Learning differentially private recurrent language models. arXiv:1710.06963.

[75]

N. Papernot, P. McDaniel, A. Sinha, and M. P. Wellman. 2018. SoK: Security and privacy in machine learning. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P’18). 399–414. DOI:

[76]

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP’17). IEEE, Los Alamitos, CA, 3–18.

[77]

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. 2020. BERT-attack: Adversarial attack against BERT using BERT. arXiv:2004.09984.

[78]

C. Goller and A. Kuchler. 1996. Learning task-dependent distributed representations by backpropagation through structure. In Proceedings of the International Conference on Neural Networks (ICNN’96).

Cited By

Keating MBollard SPotter S(2024)Assessing the Quality, Readability, and Acceptability of AI-Generated Information in Plastic and Aesthetic SurgeryCureus10.7759/cureus.73874Online publication date: 17-Nov-2024
https://doi.org/10.7759/cureus.73874
Sarkar AChuang YMohammed NJiang X(2024)De-identification is not enough: a comparison between de-identified and synthetic clinical notesScientific Reports10.1038/s41598-024-81170-y14:1Online publication date: 29-Nov-2024
https://doi.org/10.1038/s41598-024-81170-y
Yang ZLi YZhou G(2023)TS-GAN: Time-series GAN for Sensor-based Health Data AugmentationACM Transactions on Computing for Healthcare10.1145/35835934:2(1-21)Online publication date: 18-Apr-2023
https://dl.acm.org/doi/10.1145/3583593
Show More Cited By

Index Terms

Differentially Private Medical Texts Generation Using Generative Neural Networks

Recommendations

Using Latent Class Analysis to Identify Sophistication Categories of Electronic Medical Record Systems in U.S. Acute Care Hospitals

Many believe that electronic medical record (EMR) systems hold promise for improving the quality of health care services. The body of research on this topic is still in the early stages, however, in part because of the challenge of measuring the ...
Development and validation of a continuous measure of patient condition using the Electronic Medical Record

Graphical abstractDisplay Omitted New method to estimate patient condition during a hospital visit.Patient condition is computed by summing risks measured in each of 26 variables.Leverages data already in the EMR: vital signs, lab results, nursing ...
Using electronic health record systems in diabetes care: emerging practices
IHI '10: Proceedings of the 1st ACM International Health Informatics Symposium

While there has been considerable attention devoted to the deployment of electronic health record (EHR) systems, there has been far less attention given to their appropriation for use in clinical encounters --- particularly in the context of complex, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computing for Healthcare

ACM Transactions on Computing for Healthcare Volume 3, Issue 1

January 2022

255 pages

EISSN:2637-8051

DOI:10.1145/3485154

Editors:
Insup Lee
University of Pennsylvania, USA
,
John A. Stankovic
University of Virginia, USA

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2021

Accepted: 01 May 2021

Revised: 01 May 2021

Received: 01 July 2020

Published in HEALTH Volume 3, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

CPRIT Scholar in Cancer Research
Christopher Sarofim Family Professorship
UT Stars
UTHealth
National Institutes of Health (NIH)
NSERC Discovery

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
968
Total Downloads

Downloads (Last 12 months)316
Downloads (Last 6 weeks)22

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Keating MBollard SPotter S(2024)Assessing the Quality, Readability, and Acceptability of AI-Generated Information in Plastic and Aesthetic SurgeryCureus10.7759/cureus.73874Online publication date: 17-Nov-2024
https://doi.org/10.7759/cureus.73874
Sarkar AChuang YMohammed NJiang X(2024)De-identification is not enough: a comparison between de-identified and synthetic clinical notesScientific Reports10.1038/s41598-024-81170-y14:1Online publication date: 29-Nov-2024
https://doi.org/10.1038/s41598-024-81170-y
Yang ZLi YZhou G(2023)TS-GAN: Time-series GAN for Sensor-based Health Data AugmentationACM Transactions on Computing for Healthcare10.1145/35835934:2(1-21)Online publication date: 18-Apr-2023
https://dl.acm.org/doi/10.1145/3583593
Akbari FSartipi KArcher N(2023)Synthetic Behavior Sequence Generation Using Generative Adversarial NetworksACM Transactions on Computing for Healthcare10.1145/35639504:1(1-23)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3563950

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents