Scalable Incident Detection Via Natural Language Processing and Probabilistic Language Models
Scalable Incident Detection Via Natural Language Processing and Probabilistic Language Models
Scalable Incident Detection Via Natural Language Processing and Probabilistic Language Models
com/scientificreports
Background
Incident detection refers to identifying new occurrence of relevant events from existing data assets and systems.
Clinical examples of incident events include myocardial infarction, overdose from substance use, or suicide
attempt. Precise detection at enterprise- or system-scale from healthcare records remains a major challenge.
Once a product is Food and Drugs Administration (FDA) approved, post-marketing safety surveillance for
this medication includes both active processes like Sentinel and passive processes like the FDA Adverse Event
Reporting System (FAERS)1–4. Identifying adverse drug events (ADEs) or new onset diseases that might relate to
those new medications remains paramount5,6. Outside of FDA regulatory processes, population health requires
1Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA. 2Department
of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA. 3Department of Psychiatry and Behavioral
Sciences, Vanderbilt University Medical Center, Nashville, TN, USA. 4Center for Drug Evaluation and Research,
United States Food and Drug Administration, Maryland, USA. 5Office of Surveillance and Epidemiology, United
States Food and Drug Administration, Maryland, USA. 6Department of Biostatistics, Epidemiology and Informatics,
and Pediatrics, University of Pennsylvania, Pennsylvania, USA. 7Department of Computer and Information Science,
Bioengineering, University of Pennsylvania, Pennsylvania, USA. 8Department of Science Communication, University
of Pennsylvania, Pennsylvania, USA. 9Washington Health Research Institute, , Kaiser Permanente Washington,
Washington, USA. 10Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine,
Brigham and Women’s Hospital, Harvard Medical School, Boston, USA. 11Department of Population Medicine,
Harvard Medical School, Harvard Pilgrim Health Care Institute, Boston, USA. 12Office of Translational Science,
United States Food and Drug Administration, Maryland, USA. 13Vanderbilt University Medical Center, Nashville,
USA. email: Colin.walsh@vumc.org
ascertainment of clinical incidents to allocate resources and properly support frontline staff, e.g., at triage in
emergency settings7. Precision medicine and predictive modeling initiatives (e.g., Clinical Decision Support
[CDS]) also depend strongly on comprehensive ascertainment of phenotypes across biobanks and healthcare
data repositories to power studies appropriately and minimize type I error8,9.
Significant attention has been given in guiding processes around reporting in post-market surveillance10,
but gaps remain. Spontaneous reporting might be effective but requires awareness and understanding among
healthcare professionals11. It also has known limitations including under-reporting, duplication, and vulnerability
to media attention or other trends in reporting such as sampling variation and errors or bias in reporting12.
Novel systems that leverage computational automation to achieve scalability might improve incident detection
in healthcare broadly and post-market surveillance specifically.
Another challenge for incident detection relates to the ability to determine whether the event started in the
past, or is prevalent, or whether it is a new occurrence, or incident. Many prevalent chronic conditions might
have acute incident exacerbations, e.g., chronic obstructive pulmonary disease exacerbations, or be punctuated
by incident clinical events, e.g., suicide attempts. Even when structured diagnostic codes exist for such events,
differentiating whether those codes describe a new incident remains difficult. Current incident detection systems
depend on structured data though efforts to expand inputs to unstructured data are underway13. Coding data
are driven by billing processes and are prioritized by clinical relevance and reimbursement rates. Codes are often
linked to conduct of procedures or diagnostic testing, are not deterministic for the coded condition, and still
might not always captured14–16. Restrictions or rules in patients’ insurance that impact coding practices may also
complicate incident assessment.
Temporality stands as another major obstacle to accurate incident detection17. Many events have sequelae that
result in similar or identical new data inputs18–20. Sequencing healthcare events or, more importantly, evaluating
potential causal links depends on establishing temporal order. Further, healthcare data might be recorded for a
given patient at a later date (e.g., clinical documentation after billing) or outside of a “healthcare encounter”, as
defined by most major vendor Electronic Health Records (EHRs).
A final challenge – most open healthcare systems do not have broad interoperability or data sharing to enable
incident detection. Efforts to ameliorate this concern include national or payor systems, e.g., Veterans Health
Administration or Kaiser Permanente, state-supported interoperability, e.g., New York’s Healthix21,22, and
vendor-led tools for common users of EHRs, e.g., EPIC Systems CareEverywhere. While a patient suffering, e.g.,
a myocardial infarction (MI), at one health system might not have billing codes recorded in EHRs at another,
that patient or their family would likely report the event to providers of care in another health system to ensure
optimal clinical decision-making and healthcare communication. However, while providers are expected to
obtain and summarize relevant patient care leading up to an encounter or interaction, structured codes from
prior care are not generally imported. A summary of outside care might be reliably documented in unstructured
clinical text in the routine practice of medicine.
Natural language processing (NLP) permits extraction and detection of incidents from unstructured textual
data, and has been used in event detection and disease onset before16,19. It has also been applied to accurately
identify social determinants of health to better understand the prevalence of these problems both within23 and
across24 health systems including in FDA-linked initiatives like Sentinel25,26. In work motivating this study, NLP
has been applied to suicidal ideation and suicide attempt yielding accurate and precise ascertainment from
unstructured text data agnostic to source data or EHR27.
To improve scalable incident detection, we developed and validated a novel incident phenotyping approach
based on a published, validated prevalence approach used to ascertain social determinants of health and
suicidality across entire healthcare records. To demonstrate generalizability, we validated this approach on two
separate phenotypes that share common challenges with respect to accurate ascertainment: (1) suicide attempt;
(2) sleep-related behaviors. Identification of these two phenotypes might also warrant further investigation and/
or have regulatory relevance given their neuropsychiatric nature.
Methods
All methods were performed in accordance with the relevant guidelines and regulations. The study was approved
by the institutional review board (IRB) at Vanderbilt University Medical Center (VUMC) with waiver of
informed consent (IRB #151156) given the infeasibility of consenting these EHR-driven analyses across a health
system and a large, retrospective dataset.
Cohort generation
Data were extracted from the Vanderbilt Research Derivative, an EHR repository including Protected Health
Information (PHI), for those receiving care at Vanderbilt University Medical Center (VUMC)28. PHI were
necessary to link to ongoing operational efforts to predict and prevent suicide pursuant to the suicidality
phenotypic work here29,30. Patient records were considered for years ranging from 1998 to 2022. For both suicide
attempt and sleep-related behaviors, we focused on adult patients aged over 18 years at the time of healthcare
encounters with any clinical narrative data in the EHR.
While the technical details of the Phenotypic Retrieval (PheRe) system adapted here have been published
elsewhere23,27, the algorithm’s retrieval method determined which records were included in this study. In brief,
after query formulation to establish key terms for each phenotype (see “Automatic extraction.” below), this
algorithm assigned scores to every adult patient record. To be included in this study, those records with any non-
zero NLP score, i.e., any single term in the query lists, were included in subsequent analyses.
Phenotype definitions
Our team has published extensively in the suicide informatics literature on developing, validating, and deploying
scalable predictive models of suicide risk into practice. As a result, suicide attempt was defined based on prior
work using published diagnostic code sets for the silver standard27 and domain knowledge-driven definitions for
the gold standard annotation (see below for details on both).
For sleep-related behaviors, our team reviewed the literature and sleep medicine textbooks for structured
diagnostic codes for sleep-related behaviors31,32. We also consulted with clinical experts in sleep-related
behaviors and sleep disorders in the Department of Otolaryngology at VUMC (see Acknowledgement). This
expertise informed both the silver and gold standards for this phenotype. The specific standards will be detailed
below; in brief, the silver standard was hypothesized to be less specific and less rigorous a performance test than
the gold standard yet easier to implement since it relied on structured data.
Early in the study, we considered various levels of granularity, e.g., “parasomnias” – more general – versus
“sleepwalking” – less general. We prioritized sleep-related behaviors that might be associated with black-box
warnings if found to be associated with a hypothetical medication or other intervention. We selected a subset
of sleep-related behaviors - sleepwalking, sleep-eating, sleep-driving - as the resulting foci for this investigation.
Temporality
23In prior work, we applied NLP to ascertain evidence of any suicide attempt or record of suicidal ideation from
EHRs across entire healthcare records27. In this work, the intent was to ascertain evidence of new, clinically
distinct incidents of these phenotypes. To move from prevalence-focused to incidence-focused algorithms, we
constrained the time windows for input data processed by NLP. For example, rather than ascertaining suicidal
ideation from every note in an individual’s health record, we considered looking at notes documented on a
single calendar day. The team discussed numerous potential temporal windows to focus the NLP including: (i)
healthcare visit episodes (time of admission to time of discharge); (ii) set time-windows, e.g., twenty-four-hour
periods; seven days; thirty days; multiple months; (iii) combinations of those two, e.g., clinical episodes plus/
minus a time-window to capture documentation lag.
After discussion and preliminary analyses, we selected a 24-hour period, midnight to the following midnight,
as the window for this incident detection NLP approach. This window was chosen for clinical utility, simplicity,
and agnosticism to vendor EHR or documentation schema. Operationally, this meant that we considered all the
notes of a patient on a given day to encode a potential incident phenotype.
Fig. 1. Overview of automatic extraction process enabling Incident detection, steps Numbered and legend
shown.
Table 1. Baseline study characteristics. *Other includes all combinations of coded race categories and a
distinct category labeled “Other” in source EHR documentation
Google’s word2vec33 and transformer-based NLP models such as Bidirectional Encoder Representations from
Transformers (BERT)34 to learn context-independent and context-sensitive word embeddings. The extraction
of phenotypic profiles consisted of iteratively expanding an initial set of high-relevant expressions (also called
‘seeds’) such as ‘suicide’ as follows. First, we ranked the learned embeddings by their similarity to the seed
embeddings. Then, we manually reviewed the top ranked expressions and selected the relevant ones as new
seed expressions. The final sets of text expressions corresponding to each phenotype of interest are listed in
eSupplement.
algorithms might perform with adequate c-statistics at the expense of low P (high false positive rates). For a
hypothetical post-market safety surveillance system, such false positives would be problematic to burden and
accuracy. Similarly, high recall ensures cases of potential adverse events in a hypothetical system would not be
missed (true positive rate).
Evaluation metrics
Metrics to evaluate NLP performance mirrored those used in preliminary analyses above, including P-R Metrics
and curves; F1-score. We also calculated error by score bin to understand how well the NLP score performed
across all thresholds. The intent was to replicate a common clinical implementation challenge – discretizing
a continuous output from an algorithm into a binary event, e.g., a decision or an intervention that cannot be
discretized in practice.
Threshold selection
Because the NLP produces a continuous score amenable to precision-recall metrics, users might select optimal
performance thresholds through traditional means, as well. For example, thresholds might be chosen that
maximize F-scores such as F0.5-, F1-, or F2-scores which emphasize precision, balanced precision/recall, or
recall, respectively. We use F1-score here to select optimal thresholds for these NLP algorithms.
Results
Baseline patient characteristics by phenotype
Across both suicide attempt and sleep-related behaviors, the study cohorts included 89,428 and 35,863 patients,
respectively. As outlined in Cohort generation above, these numbers included any records with at least one query
term match in the day of notes for that patient. Baseline study characteristics at each patient’s first documented
visit are shown (Table 1).
Discussion
In this study, an NLP-based incident detection system was developed and validated across two challenging and
disparate phenotypes. Such detection was feasible to conduct using a 24-hour period of documentation, agnostic
to EHR architecture or data standard, or to underlying textual source systems. The implications of this system
for initiatives like FDA Sentinel indicate scalable detection would be achievable with appropriate evaluation and
milestones in the implementation path. For example, in these two phenotypes, gold standard manual chart review
was necessary and would remain necessary for novel phenotypes on a subset of charts in development sites. But
here as in many phenotypic examples, prior work facilitated the sample size calculation and chart review itself.
Moreover, silver standard validation permitted efficient sample size calculation and error estimation across all
possible NLP scores, not solely those highest ranked “top-K” records as in traditional information retrieval. Even
with imperfect coded race variables, performance differences were easily identifiable here. Algorithmovigilance37
to prevent perpetuating or worsening disparities remains paramount in the evolution of systems like this one –
not solely at initial algorithm validation but throughout the life cycle38.
Performance varied by phenotype with better performance in suicide attempt, a phenotype that was more
common and with a clearly defined clinically observable set of attributes. Sleep-related behaviors, even when
focused on sleepwalking, -eating, -driving, are still a group of diagnoses that may or may not be documented
clearly in every visit even when present. That is, these selected phenotypes differ in clinical specificity and in the
degree to which care will focus on them if observed. However, both might be associated with regulatory action
if a new medication or device were shown to cause them. Despite the rarity of all phenotypes in this study, the
unsupervised learning approach used to derive NLP scores relies on the assertions present in each record, not
on the overall prevalence or case balance as in a supervised learning paradigm. Thus rarity should not introduce
bias or affect performance for the unsupervised learning NLP developed here, nor would it for subsequent rare
phenotypes in future work.
Active efforts including a recent National Academy of Science, Engineering, and Medicine (NASEM) report
suggest excluding race from genetic and genomic studies39. Race remains a social construct with potential to
reflect or worsen healthcare disparities if not handled appropriately. Use of clinical language does vary by race
and this usage remains an important limitation of studies like this one. Here, coded race was used as a means of
testing these NLP algorithms for disparate performance as an early checkpoint in model validation. Our findings
are hypothesis-generating with respect to reasons for performance differences and more detailed analyses with
better quality race data are indicated in any case. A parallel effort in replicating an approach like this one in new
phenotypes would necessitate careful consideration of factors like demographics, clinical attributes or others
that might undermine the successes of an incident detection system at scale.
Silver standard evaluation alone was not sufficient to estimate final NLP performance observed here. Both
sleep-related behaviors and suicide attempt were associated with ~ 60% PPV in silver standard, ICD-based,
performance but suicide attempt as a phenotype was much better identified in this method than sleep-related
behaviors. Thus, some manner of gold standard evaluation is indicated for adding new phenotypes to this system.
To that end, annotation guides and annotator training facilitated rigorous multi-reviewer chart validation as did
sample size calculation of numbers of required charts to review based on the silver standard.
This work builds on the work of others by adding to understanding of unstructured data-based phenotyping
algorithms in neuropsychiatric phenotypes with emphasis on temporality and incident detection. Determining
temporal onset of symptoms with NLP has been attempted in clinical areas including psychosis40, perinatal
health41, and hematology42. Deep learning has been used with NLP-based features to identify acute on chronic
exacerbations of chronic diseases such as hypoglycemic events in diabetes43. More recently, large language
models (LLMs) have been studied for clinical text extraction44–47. The rise in prominence of LLMs occurred
after the work reported here and our team and others are now investigating their performance in NLP-based
ascertainment. While phenotyping in suicidality has included NLP in numerous studies including those of this
team, phenotyping in sleep disorders has been less commonly reported48,49. This study adds to evidence that
sleep-related behaviors might be less well-coded and well-documented than other neuropsychiatric phenotypes
and therefore NLP-based algorithms to detect them were more challenging to develop.
Overall, this NLP-based incident detection approach scaled to diverse, rare phenotypes and benefited from
multiple levels of silver and gold standard evaluation. A post-market safety surveillance system using this
phenotypic detection method would need important advances in addition to those reported here to be effective.
For example, sequence handling and temporal representation to capture sequences of events would be critical
to move toward causal inference. Design of interpretable, transparent interfaces that summarized candidate
incidents and their contributors would also be needed. Coupling such advances with expansion of incidents to
new phenotypes would be important areas of future work.
Data availability
The datasets generated and/or analyzed during the current study are not publicly available due to their sensitive
nature and their inclusion of PHI/PII but are available from the corresponding author on reasonable request.
References
1. Ball, R., Robb, M. & Anderson, S. Dal Pan, G. The FDA’s sentinel initiative—A comprehensive approach to medical product
surveillance. Clin. Pharmacol. Ther. 99, 265–268 (2016).
2. Behrman, R. E. et al. Developing the Sentinel System — A National Resource for evidence development. N Engl. J. Med. 364,
498–499 (2011).
3. Robb, M. A. et al. The US Food and Drug Administration’s Sentinel Initiative: expanding the horizons of medical product safety.
Pharmacoepidemiol Drug Saf. 21, 9–11 (2012).
4. Platt, R. et al. The FDA Sentinel Initiative — an Evolving National Resource. N Engl. J. Med. 379, 2091–2093 (2018).
5. Feng, C., Le, D. & McCoy, A. B. Using Electronic Health Records to identify adverse drug events in Ambulatory Care: a systematic
review. Appl. Clin. Inf. 10, 123–128 (2019).
6. Liu, F., Jagannatha, A. & Yu, H. Towards Drug Safety Surveillance and Pharmacovigilance: current progress in detecting medication
and adverse drug events from Electronic Health Records. Drug Saf. 42, 95–97 (2019).
7. Fernandes, M. et al. Clinical decision support systems for Triage in the Emergency Department using Intelligent systems: a review.
Artif. Intell. Med. 102, 101762 (2020).
8. Panahiazar, M., Taslimitehrani, V., Pereira, N. L. & Pathak, J. Using EHRs for heart failure therapy recommendation using
Multidimensional Patient Similarity Analytics. Stud. Health Technol. Inf. 210, 369–373 (2015).
9. Zhang, P., Wang, F., Hu, J. & Sorrentino, R. Towards personalized medicine: leveraging patient similarity and drug similarity
analytics. AMIA Jt. Summits Transl. Sci. Proc. AMIA Jt. Summits Transl. Sci. 132–136 (2014). (2014).
10. Health, C. D. and R. Postmarket Surveillance Under Sect. 522 of the Federal Food, Drug, and Cosmetic Act. U.S. Food and Drug
Administration (2022). https://www.fda.gov/regulatory-information/search-fda-guidance-documents/postmarket-surveillance-
under-section-522-federal-food-drug-and-cosmetic-act
11. Alomar, M., Tawfiq, A. M., Hassan, N. & Palaian, S. Post marketing surveillance of suspected adverse drug reactions through
spontaneous reporting: current status, challenges and the future. Ther. Adv. Drug Saf. 11, 2042098620938595 (2020).
12. Bate, A. & Evans, S. J. W. quantitative signal detection using spontaneous ADR reporting. Pharmacoepidemiol Drug Saf. 18, 427–
436 (2009).
13. Methods | Sentinel Initiative. https://www.sentinelinitiative.org/methods-data-tools/methods
14. Banerji, A. et al. Natural Language Processing combined with ICD-9-CM codes as a Novel Method to study the epidemiology of
allergic drug reactions. J. Allergy Clin. Immunol. Pract. 8, 1032–1038e1 (2020).
15. Bayramli, I. et al. Predictive structured-unstructured interactions in EHR models: a case study of suicide prediction. NPJ Digit.
Med. 5, 15 (2022).
16. Borjali, A. et al. Natural language processing with deep learning for medical adverse event detection from free-text medical
narratives: a case study of detecting total hip replacement dislocation. Comput. Biol. Med. 129, 104140 (2021).
17. Xie, F. et al. Deep learning for temporal data representation in electronic health records: a systematic review of challenges and
methodologies. J. Biomed. Inf. 126, 103980 (2022).
18. Sun, W., Rumshisky, A. & Uzuner, O. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. J. Am. Med. Inf. Assoc. 20,
806–813 (2013).
19. Viani, N. et al. A natural language processing approach for identifying temporal disease onset information from mental healthcare
text. Sci. Rep. 11, 757 (2021).
20. Sheikhalishahi, S. et al. Natural Language Processing of Clinical Notes on Chronic diseases: systematic review. JMIR Med. Inf. 7,
e12239 (2019).
21. Zech, J., Husk, G., Moore, T., Kuperman, G. J. & Shapiro, J. S. Identifying homelessness using health information exchange data. J.
Am. Med. Inf. Assoc. JAMIA. 22, 682–687 (2015).
22. Moore, T. et al. Event detection: a clinical notification service on a health information exchange platform. AMIA Annu. Symp. Proc.
AMIA Symp. 2012, 635–642 (2012).
23. Bejan, C. A. et al. Mining 100 million notes to find homelessness and adverse childhood experiences: 2 case studies of rare and
severe social determinants of health in electronic health records. J. Am. Med. Inf. Assoc. JAMIA. 25, 61–71 (2018).
24. Dorr, D. et al. Identifying patients with significant problems related to Social Determinants of Health with Natural Language
Processing. Stud. Health Technol. Inf. 264, 1456–1457 (2019).
25. Desai, R. J. et al. Broadening the reach of the FDA Sentinel system: a roadmap for integrating electronic health record data in a
causal analysis framework. NPJ Digit. Med. 4, 170 (2021).
26. Carrell, D. S. et al. Improving methods of identifying Anaphylaxis for Medical Product Safety Surveillance using Natural Language
Processing and Machine Learning. Am. J. Epidemiol. 192, 283–295 (2023).
27. Bejan, C. A. et al. Improving ascertainment of suicidal ideation and suicide attempt with natural language processing. Sci. Rep. 12,
15146 (2022).
28. Danciu, I. et al. Secondary use of clinical data: the Vanderbilt approach. J. Biomed. Inf. 52, 28–35 (2014).
29. Walsh, C. G. et al. Prospective validation of an Electronic Health Record–Based, real-time suicide risk model. JAMA Netw. Open.
4, e211428 (2021).
30. Wilimitis, D. et al. Integration of Face-to-face Screening with Real-time machine learning to Predict risk of suicide among adults.
JAMA Netw. Open. 5, e2212095 (2022).
31. The Oxford Handbook of Sleep and Sleep Disorders. (Oxford University Press, doi: (2012). https://doi.org/10.1093/
oxfordhb/9780195376203.001.0001
32. Barkoukis, T. J., Matheson, J. K., Ferber, R. & Doghramji, K. Therapy in Sleep Medicine E-Book (Elsevier Health Sciences, 2011).
33. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their
compositionality. in Advances in Neural Information Processing Systems vol. 26 (Curran Associates, Inc., (2013).
34. Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. Preprint at (2019). https://doi.org/10.48550/arXiv.1810.04805
35. WHO | International Classification of Diseases. WHO (2017). http://www.who.int/classifications/icd/en/
36. Swain, R. S. et al. A systematic review of validated suicide outcome classification in observational studies. Int. J. Epidemiol. 48,
1636–1649 (2019).
37. Embi, P. J. Algorithmovigilance—advancing methods to analyze and monitor Artificial Intelligence–Driven Health Care for
Effectiveness and Equity. JAMA Netw. Open. 4, e214622 (2021).
38. J. Am. Med. Inform. Assoc. 26, 1645–1650 (2019).
39. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. (National Academies,
Washington, D.C., doi: (2023). https://doi.org/10.17226/26902
40. Viani, N. et al. Annotating temporal relations to determine the onset of psychosis symptoms. Stud. Health Technol. Inf. 264, 418–
422 (2019).
41. Ayre, K. et al. Developing a Natural Language Processing tool to identify perinatal self-harm in electronic healthcare records. PloS
One. 16, e0253809 (2021).
42. Fu, J. T., Sholle, E., Krichevsky, S., Scandura, J. & Campion, T. R. Extracting and classifying diagnosis dates from clinical notes: a
case study. J. Biomed. Inf. 110, 103569 (2020).
43. Jin, Y., Li, F., Vimalananda, V. G. & Yu, H. Automatic Detection of Hypoglycemic Events from the Electronic Health Record notes
of Diabetes patients: empirical study. JMIR Med. Inf. 7, e14340 (2019).
44. Cheligeer, C. et al. Validating Large Language Models for Identifying Pathologic Complete Responses After Neoadjuvant
Chemotherapy for Breast Cancer Using a Population-Based Pathologic Report Data. Preprint at https://doi.org/https://doi.
org/10.21203/rs.3.rs-4004164/v1 (2024).
45. Yang, J. et al. Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT.
Patterns 5, (2024).
46. Elmarakeby, H. A. et al. Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports. BMC
Bioinform. 24, 328 (2023).
47. Hays, S. & White, D. J. Employing LLMs for Incident Response Planning and Review. Preprint at (2024). https://doi.org/10.48550/
arXiv.2403.01271
48. Cade, B. E. et al. Sleep apnea phenotyping and relationship to disease in a large clinical biobank. JAMIA Open. 5, ooab117 (2022).
49. Chen, W., Kowatch, R., Lin, S., Splaingard, M. & Huang, Y. Interactive cohort identification of Sleep Disorder patients using
Natural Language Processing and i2b2. Appl. Clin. Inf. 6, 345–363 (2015).
Acknowledgements
Our team thanks Dr. David Kent in the Department of Otolaryngology at VUMC for expertise in sleep-related
behaviors and insight into phenotypic definitions and acceptable silver standard diagnostic codes used here.We
thank Dr. Patricia Bright for reviewing our manuscript prior to submission.
Author contributions
CW wrote the manuscript. CW, DW, CB, QC, MR conducted modeling and analyses. DW, CB prepared figures.
All authors reviewed the manuscript.
Funding
All investigators were supported on FDA WO2006. Dr. Walsh is also supported in part by NIMH R01MH121455,
R01MH116269, and Wellcome Leap MCPsych.
Funders played no role in design and conduct of the study; collection, management, analysis, and interpre-
tation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript
for publication.
Declarations
Competing interests
The authors declare no competing interests.
Additional information
Supplementary Information The online version contains supplementary material available at https://doi.
org/10.1038/s41598-024-72756-7.
Correspondence and requests for materials should be addressed to C.G.W.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide
a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have
permission under this licence to share adapted material derived from this article or parts of it. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/
licenses/by-nc-nd/4.0/.