Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3580305.3599427acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Free access

MedLink: De-Identified Patient Health Record Linkage

Published: 04 August 2023 Publication History
  • Get Citation Alerts
  • Abstract

    A comprehensive patient health history is essential for patient care and healthcare research. However, due to the distributed nature of healthcare services, patient health records are often scattered across multiple systems. Existing record linkage approaches primarily rely on patient identifiers, which have inherent limitations such as privacy invasion and identifier discrepancies. To tackle this problem, we propose linking de-identified patient health records by matching health patterns without strictly relying on sensitive patient identifiers. Our model MedLink solves two challenges faced with the patient linkage task: (1) the challenge of identifying the same patients based on data collected in different timelines as disease progression makes the record matching difficult, and (2) the challenge of identifying distinct health patterns as common medical codes dominate health records and overshadow the more informative low-prevalence codes. To address these challenges, MedLink utilizes bi-directional health prediction to predict future codes forwardly and past codes backwardly, thus accounting for the health progression. MedLink also has a prevalence-aware retrieval design to focus more on the low-prevalence but informative codes during learning. MedLink can be trained end-to-end and is lightweight for efficient inference on large patient databases. We evaluate MedLink against leading baselines on real-world patient datasets, including the critical care dataset MIMIC-III and a large health claims dataset. Results show that MedLink outperforms the best baseline by 4% in top-1 accuracy with only 8% memory cost. Additionally, when combined with existing identifier-based linkage approaches, MedLink can improve their performance by up to 15%.

    Supplementary Material

    MP4 File (rtfp0941-2min-promo.mp4)
    Presentation video - short version

    References

    [1]
    Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fangshan Wang, and Qun Liu. 2020. SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval. CoRR abs/2010.00768 (2020). arXiv:2010.00768 https://arxiv.org/abs/2010.00768
    [2]
    Inci M. Baytas, Cao Xiao, Xi Zhang, Fei Wang, Anil K. Jain, and Jiayu Zhou. 2017. Patient Subtyping via Time-Aware LSTM Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD '17). Association for Computing Machinery, New York, NY, USA, 65--74. https://doi.org/10.1145/3097983.3097997
    [3]
    Adam Berger, Rich Caruana, David Cohn, Dayne Freitag, and Vibhu Mittal. 2000. Bridging the Lexical Chasm: Statistical Approaches to Answer-Finding. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Athens, Greece) (SIGIR '00). Association for Computing Machinery, New York, NY, USA, 192--199. https://doi.org/10. 1145/345508.345576
    [4]
    Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. 2016. RETAIN: An Interpretable Predictive Model for Healthcare Using Reverse Time Attention Mechanism. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain) (NIPS'16). Curran Associates Inc., Red Hook, NY, USA, 3512--3520.
    [5]
    Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. 2016. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. In Proceedings of the 1st Machine Learning for Healthcare Conference (Proceedings of Machine Learning Research, Vol. 56), Finale Doshi-Velez, Jim Fackler, David Kale, Byron Wallace, and Jenna Wiens (Eds.). PMLR, Northeastern University, Boston, MA, USA, 301--318. https://proceedings.mlr.press/v56/Choi16.html
    [6]
    Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. 2016. Multi-Layer Representation Learning for Medical Concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD '16). Association for Computing Machinery, New York, NY, USA, 1495--1504. https://doi.org/10.1145/2939672.2939823
    [7]
    Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F. Stewart, and Jimeng Sun. 2017. GRAM: Graph-Based Attention Model for Healthcare Representation Learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD '17). Association for Computing Machinery, New York, NY, USA, 787--795. https://doi.org/10.1145/3097983.3098126
    [8]
    Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. 2016. Using recurrent neural network models for early detection of heart failure onset. Journal of the American Medical Informatics Association 24, 2 (08 2016), 361--370. https: //doi.org/10.1093/jamia/ocw112 arXiv:https://academic.oup.com/jamia/article-pdf/24/2/361/34148331/ocw112.pdf
    [9]
    Edward Choi, Zhen Xu, Yujia Li, Michael Dusenberry, Gerardo Flores, Emily Xue, and Andrew Dai. 2020. Learning the Graphical Structure of Electronic Health Records with Graph Convolutional Transformer. Proceedings of the AAAI Conference on Artificial Intelligence 34 (04 2020), 606--613. https://doi.org/10. 1609/aaai.v34i01.5400
    [10]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
    [11]
    J. C. Doidge and K. Harron. 2018. Demystifying probabilistic linkage: Common myths and misconceptions. Int J Popul Data Sci 3, 1 (Jan 2018), 410.
    [12]
    Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. Association for Computing Machinery, New York, NY, USA, 2288--2292. https://doi.org/10.1145/ 3404835.3463098
    [13]
    Aris Gkoulalas-Divanis, Dinusha Vatsalan, Dimitrios Karapiperis, and Murat Kantarcioglu. 2021. Modern Privacy-Preserving Record Linkage Techniques: An Overview. IEEE Transactions on Information Forensics and Security 16 (2021), 4966--4987. https://doi.org/10.1109/TIFS.2021.3114026
    [14]
    Katie Harron, Chris Dibben, James Boyd, Anders Hjern, Mahmoud Azimaee, Mauricio L Barreto, and Harvey Goldstein. 2017. Challenges in administrative data linkage for research. Big Data & Society 4, 2 (2017), 2053951717745678. https://doi. org/10.1177/2053951717745678 arXiv:https://doi.org/10.1177/2053951717745678 30381794.
    [15]
    Hrayr Harutyunyan, Hrant Khachatrian, David C. Kale, Greg Ver Steeg, and Aram Galstyan. 2019. Multitask learning and benchmarking with clinical time series data. Scientific Data 6, 1 (Jun 2019). https://doi.org/10.1038/s41597-019-0103--9
    [16]
    Boris P. Hejblum, Griffin M. Weber, Katherine P. Liao, Nathan P. Palmer, Susanne Churchill, Nancy A. Shadick, Peter Szolovits, Shawn N. Murphy, Isaac S. Kohane, and Tianxi Cai. 2019. Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes. Scientific Data 6, 1 (March 2019), 180298. https://doi.org/10.1038/sdata.2018.298
    [17]
    Imprivata. 2016. 2016 National Patient Misidentification Report. (2016).
    [18]
    Tim Joda, Tuomas Waltimo, Christiane Pauli-Magnus, Nicole Probst-Hensch, and Nicola U. Zitzmann. 2018. Population-Based Linkage of Big Data in Dental Research. International Journal of Environmental Research and Public Health 15, 11 (2018). https://doi.org/10.3390/ijerph15112357
    [19]
    Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3 (2016), 160035.
    [20]
    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769--6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
    [21]
    Yikuan Li, Shishir Rao, José Roberto Ayala Solares, Abdelaali Hassaine, Rema Ramakrishnan, Dexter Canoy, Yajie Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. 2020. BEHRT: Transformer for Electronic Health Records. Scientific Reports 10, 1 (Dec. 2020), 7155. https://doi.org/10.1038/s41598-020--62922-y
    [22]
    Ilya Loshchilov and Frank Hutter. 2017. Fixing Weight Decay Regularization in Adam. abs/1711.05101 (2017). arXiv:1711.05101 http://arxiv.org/abs/1711.05101
    [23]
    Junyu Luo, Muchao Ye, Cao Xiao, and Fenglong Ma. 2020. HiTANet: Hierarchical Time-Aware Attention Networks for Risk Prediction on Electronic Health Records. Association for Computing Machinery, New York, NY, USA, 647--656. https: //doi.org/10.1145/3394486.3403107
    [24]
    Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. 2017. Dipole: Diagnosis Prediction in Healthcare via Attention-Based Bidirectional Recurrent Neural Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD '17). Association for Computing Machinery, New York, NY, USA, 1903--1911. https://doi.org/10.1145/3097983.3098088
    [25]
    Liantao Ma, Junyi Gao, Yasha Wang, Chaohe Zhang, Jiangtao Wang, Wenjie Ruan, Wen Tang, Xin Gao, and Xinyu Ma. 2020. AdaCare: Explainable Clinical Health Status Representation Learning via Scale-Adaptive Feature Extraction and Recalibration. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 825--832. https://aaai.org/ojs/index.php/AAAI/article/view/5427
    [26]
    Liantao Ma, Chaohe Zhang, Yasha Wang, Wenjie Ruan, Jiangtao Wang, Wen Tang, Xinyu Ma, Xin Gao, and Junyi Gao. 2020. ConCare: Personalized Clinical Feature Embedding via Capturing the Healthcare Context. Proceedings of the AAAI Conference on Artificial Intelligence 34, 01 (Apr. 2020), 833--840. https: //doi.org/10.1609/aaai.v34i01.5428
    [27]
    P. Nguyen, T. Tran, N. Wickramasinghe, and S. Venkatesh. 2017. Deepr: A Convolutional Net for Medical Records. IEEE Journal of Biomedical and Health Informatics 21, 1 (2017), 22--30. https://doi.org/10.1109/JBHI.2016.2633963
    [28]
    Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv preprint arXiv:1904.08375 (2019).
    [29]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
    [30]
    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532--1543.
    [31]
    Ivan P.Fellegi and Alan B.Sunter. 1969. A Theory for Record Linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183--1210. https://doi.org/10.1080/01621459.1969.10501049 arXiv:https://www.tandfonline.com/doi/pdf/10.1080/01621459.1969.10501049
    [32]
    Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. 2021. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine 4, 1 (May 2021), 86. https: //doi.org/10.1038/s41746-021-00455-y
    [33]
    The National Reporting and Learning System. 2018. NRLS National Patient Safety Incident Reports: Commentary. https://nhsicorporatesite.blob.core. windows.net/green/uploads/documents/NAPSIR_commentary_FINAL_data_ to_December_2017.pdf
    [34]
    Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (apr 2009), 333--389. https://doi.org/10.1561/1500000019
    [35]
    Rainer Schnell, Tobias Bachteler, and Jörg Reiher. 2009. Privacy-preserving record linkage using Bloom filters. BMC medical informatics and decision making 9 (09 2009), 41. https://doi.org/10.1186/1472--6947--9--41
    [36]
    Junyuan Shang, Cao Xiao, Tengfei Ma, Hongyan Li, and Jimeng Sun. 2019. GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 1126--1133. https://doi.org/10.1609/aaai.v33i01.33011126
    [37]
    J. Song, E. Elliot, A. D. Morris, J. J. Kerssens, A. Akbari, S. Ellwood-Thompson, and R. A. Lyons. 2018. A case study in distributed team science in research using electronic health records. Int J Popul Data Sci 3, 3 (Sep 2018), 442.
    [38]
    K. C. Stange. 2009. The problem of fragmentation and the need for integrative solutions. Ann Fam Med 7, 2 (2009), 100--103.
    [39]
    Ethan Steinberg, Ken Jung, Jason A. Fries, Conor K. Corbin, Stephen R. Pfohl, and Nigam H. Shah. 2021. Language models are an effective representation learning technique for electronic health record data. Journal of Biomedical Informatics 113 (2021), 103637. https://doi.org/10.1016/j.jbi.2020.103637
    [40]
    Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. CoRR abs/1409.3215 (2014). arXiv:1409.3215 http://arxiv.org/abs/1409.3215
    [41]
    Kha Vo, Jitendra Jonnagaddala, and Siaw-Teng Liaw. 2019. Statistical supervised meta-ensemble algorithm for medical record linkage. Journal of Biomedical Informatics 95 (2019), 103220. https://doi.org/10.1016/j.jbi.2019.103220
    [42]
    D. Randall Wilson. 2011. Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In The 2011 International Joint Conference on Neural Networks. 9--14. https://doi.org/10. 1109/IJCNN.2011.6033192
    [43]
    Zhenbang Wu, Cao Xiao, Lucas M. Glass, David M. Liebovitz, and Jimeng Sun. 2023. AutoMap: Automatic Medical Code Mapping for Clinical Prediction Model Deployment. In Machine Learning and Knowledge Discovery in Databases, Massih-Reza Amini, Stéphane Canu, Asja Fischer, Tias Guns, Petra Kralj Novak, and Grigorios Tsoumakas (Eds.). Springer International Publishing, Cham, 505--520.
    [44]
    Cao Xiao, Edward Choi, and Jimeng Sun. 2018. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. Journal of the American Medical Informatics Association 25, 10 (06 2018), 1419--1428. https: //doi.org/10.1093/jamia/ocy068 arXiv:https://academic.oup.com/jamia/article-pdf/25/10/1419/34150605/ocy068.pdf
    [45]
    Zhen Xu, David R. So, and Andrew M. Dai. 2021. MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records. In AAAI. AAAI Press, 10532--10540.
    [46]
    Chaohe Zhang, Xin Gao, Liantao Ma, Yasha Wang, Jiangtao Wang, and Wen Tang. 2021. GRASP: Generic Framework for Health Status Representation Learning Based on Incorporating Knowledge from Similar Patients. Proceedings of the AAAI Conference on Artificial Intelligence 35, 1 (May 2021), 715--723. https: //ojs.aaai.org/index.php/AAAI/article/view/16152
    [47]
    Tiancheng Zhao, Xiaopeng Lu, and Kyusong Lee. 2020. SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval. arXiv:2009.13013 [cs.CL]

    Cited By

    View all
    • (2023)An iterative self-learning framework for medical domain generalizationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668515(54833-54854)Online publication date: 10-Dec-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    August 2023
    5996 pages
    ISBN:9798400701030
    DOI:10.1145/3580305
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 August 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. electronic health record
    2. entity resolution
    3. patient deduplication
    4. patient identification
    5. patient linkage
    6. record linkage

    Qualifiers

    • Research-article

    Funding Sources

    • SCH
    • IIS

    Conference

    KDD '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)522
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)An iterative self-learning framework for medical domain generalizationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668515(54833-54854)Online publication date: 10-Dec-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media