Understanding the contribution of genetic variation to drug response can improve the delivery of ... more Understanding the contribution of genetic variation to drug response can improve the delivery of precision medicine. However, genome-wide association studies (GWAS) for drug response are uncommon and are often hindered by small sample sizes. We present a high-throughput framework to efficiently identify eligible patients for genetic studies of adverse drug reactions (ADRs) using “drug allergy” labels from electronic health records (EHRs). As a proof-of-concept, we conducted GWAS for ADRs to 14 common drug/drug groups with 81,739 individuals from Vanderbilt University Medical Center’s BioVU DNA Biobank. We identified 7 genetic loci associated with ADRs at P < 5 × 10−8, including known genetic associations such as CYP2D6 and OPRM1 for CYP2D6-metabolized opioid ADR. Additional expression quantitative trait loci and phenome-wide association analyses added evidence to the observed associations. Our high-throughput framework is both scalable and portable, enabling impactful pharmacogen...
Introduction Currently, one of the commonly used methods for disseminating electronic health reco... more Introduction Currently, one of the commonly used methods for disseminating electronic health record (EHR)-based phenotype algorithms is providing a narrative description of the algorithm logic, often accompanied by flowcharts. A challenge with this mode of dissemination is the potential for under-specification in the algorithm definition, which leads to ambiguity and vagueness. Methods This study examines incidents of under-specification that occurred during the implementation of 34 narrative phenotyping algorithms in the electronic Medical Record and Genomics (eMERGE) network. We reviewed the online communication history between algorithm developers and implementers within the Phenotype Knowledge Base (PheKB) platform, where questions could be raised and answered regarding the intended implementation of a phenotype algorithm. Results We developed a taxonomy of under-specification categories via an iterative review process between two groups of annotators. Under-specifications that ...
Discovering novel uses for existing drugs, through drug repurposing, can reduce the time, costs, ... more Discovering novel uses for existing drugs, through drug repurposing, can reduce the time, costs, and risk of failure associated with new drug development. However, prioritizing drug repurposing candidates for downstream studies remains challenging. Here, we present a high-throughput approach to identify and validate drug repurposing candidates. This approach integrates human gene expression, drug perturbation, and clinical data from publicly available resources. We apply this approach to find drug repurposing candidates for two diseases, hyperlipidemia and hypertension. We screen >21,000 compounds and replicate ten approved drugs. We also identify 25 (seven for hyperlipidemia, eighteen for hypertension) drugs approved for other indications with therapeutic effects on clinically relevant biomarkers. For five of these drugs, the therapeutic effects are replicated in the All of Us Research Program database. We anticipate our approach will enable researchers to integrate multiple pub...
The MEDication-Indication (MEDI) knowledgebase has been utilized in research with electronic heal... more The MEDication-Indication (MEDI) knowledgebase has been utilized in research with electronic health records (EHRs) since its publication in 2013. To account for new drugs and terminology updates, we rebuilt MEDI to overhaul the knowledgebase for modern EHRs. Indications for prescribable medications were extracted using natural language processing and ontology relationships from six publicly available resources: RxNorm, Side Effect Resource 4.1, Mayo Clinic, WebMD, MedlinePlus, and Wikipedia. We compared the estimated precision and recall between the previous MEDI (MEDI-1) and the updated version (MEDI-2) with manual review. MEDI-2 contains 3031 medications and 186,064 indications. The MEDI-2 high precision subset (HPS) includes indications found within RxNorm or at least three other resources. MEDI-2 and MEDI-2 HPS contain 13% more medications and over triple the indications compared to MEDI-1 and MEDI-1 HPS, respectively. Manual review showed MEDI-2 achieves the same precision (0.6...
Background & AimsPrimary non-response (PNR) to anti-tumor necrosis factor-α (TNFα) biologics is a... more Background & AimsPrimary non-response (PNR) to anti-tumor necrosis factor-α (TNFα) biologics is a serious concern in patients with inflammatory bowel disease (IBD). We aimed to identify the genetic variants associated with PNR.MethodsPatients were recruited from outpatient GI clinics and PNR was determined using both clinical and endoscopic findings. A case-control genome-wide association study was performed in 589 IBD patients and associations were replicated in an independent cohort of 293 patients. Effect of the associated variant on gene expression and TNFα secretion was assessed by cell-based assays. Pleiotropic effects were investigated by Phenome-wide Association Study (PheWAS).ResultsWe identified rs34767465 as associated with PNR to anti-TNF-α therapy (OR:2.07, 95%CI:1.46-2.94, p=2.43×10−7, [Replication OR:1.8, 95%CI:1.04-3.16, p=0.03]). rs34767465 is a multiple-tissue expression quantitative trait loci for FAM114A2. Using RNA-sequencing and protein quantification from HapM...
ABSTRACTIntroductionThe role of serum urate level has been extensively investigated in observatio... more ABSTRACTIntroductionThe role of serum urate level has been extensively investigated in observational studies. However, the extent of any causal effect remains unclear, making it difficult to evaluate its clinical relevance.ObjectivesTo explore any causal or pleiotropic association between serum urate level and a broad spectrum of disease outcomes.MethodsPhenome-wide association study (PheWAS) together with a Bayesian analysis of tree-structured phenotypic models (TreeWAS) was performed to examine disease outcomes related to genetically determined serum urate levels in 339,256 UK Biobank participants. Mendelian Randomisation (MR) analyses were performed to replicate significant findings using various GWAS consortia data. Sensitivity analyses were conducted to examine possible pleiotropic effects on metabolic traits of the genetic variants used as instruments for serum urate.ResultsPheWAS analysis, examining the association with 1,431 disease outcomes, identified a multitude of diseas...
Analyzing 5770 all-cause cirrhosis cases and 572,850 controls from seven cohorts, we identify a m... more Analyzing 5770 all-cause cirrhosis cases and 572,850 controls from seven cohorts, we identify a missense variant in the Mitochondrial Amidoxime Reducing Component 1 gene (MARC1 p.A165T) that associates with protection from all-cause cirrhosis (OR 0.88, p=2.1*10−8). This same variant also associates with lower levels of hepatic fat on computed tomographic imaging and lower odds of physician-diagnosed fatty liver as well as lower blood levels of alanine transaminase (−0.012 SD, 1.4*10−8), alkaline phosphatase (−0.019 SD, 6.6*10−9), total cholesterol (−0.037 SD, p=1*10−18) and LDL cholesterol (−0.035 SD, p=7.3*10−16). Carriers of rare protein-truncating variants in MARC1 had lower liver enzyme levels, cholesterol levels, and reduced odds of liver disease (OR 0.19, p= 0.04) suggesting that deficiency of the MARC1 enzyme protects against cirrhosis.
Genome-wide and phenome-wide association studies are commonly used to identify important relation... more Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most of these studies have treated diseases as independent variables and suffered from heavy multiple adjustment burdens due to the large number of genetic variants and disease phenotypes. In this study, we propose using topic modeling via non-negative matrix factorization (NMF) for identifying associations between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine learning approach that can be used to learn the semantic patterns from electronic health record data. We chose rs10455872 in LPA as the predictor since it has been shown to be associated with increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 individuals from the biobank at Vanderbilt University Medical Center, we trained a topic model using NMF from 1,853 distinct phecodes extracted from the cohort’s electronic...
BackgroundThe PheCode system was built upon the International Classification of Diseases, Ninth R... more BackgroundThe PheCode system was built upon the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) for phenome-wide association studies (PheWAS) in the electronic health record (EHR).ObjectiveHere, we present our work on the development and evaluation of maps from ICD-10 and ICD-10-CM codes to PheCodes.MethodsWe mapped ICD-10 and ICD-10-CM codes to PheCodes using a number of methods and resources, such as concept relationships and explicit mappings from the Unified Medical Language System (UMLS), Observational Health Data Sciences and Informatics (OHDSI), Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT), and National Library of Medicine (NLM). We assessed the coverage of the maps in two databases: Vanderbilt University Medical Center (VUMC) using ICD-10-CM and the UK Biobank (UKBB) using ICD-10. We assessed the fidelity of the ICD-10-CM map in comparison to the gold-standard ICD-9-CM→PheCode map by investigating phenotype rep...
BackgroundDrug effects can be investigated through natural variation in the genes for their prote... more BackgroundDrug effects can be investigated through natural variation in the genes for their protein targets. We aimed to use this approach to explore the potential side effects and repurposing potential of antihypertensive drugs, which are amongst the most commonly used medications worldwide.MethodsWe identified genetic instruments for antihypertensive drug classes as variants in the gene for the corresponding target that associated with systolic blood pressure at genome-wide significance. To validate the instruments, we compared Mendelian randomisation (MR) estimates for drug effects on coronary heart disease (CHD) and stroke risk to randomised controlled trial (RCT) results. Phenome-wide association study (PheWAS) in the UK Biobank was performed to identify potential side effects and repurposing opportunities, with findings investigated in the Vanderbilt University Biobank (BioVU) and in observational analysis of the UK Biobank.FindingsWe identified suitable genetic instruments fo...
In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most bin... more In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for >...
We aimed to investigate the role of serum uric acid (SUA) level in a broad spectrum of disease ou... more We aimed to investigate the role of serum uric acid (SUA) level in a broad spectrum of disease outcomes using data for 120 091 individuals from UK Biobank. We performed a phenome-wide association study (PheWAS) to identify disease outcomes associated with SUA genetic risk loci. We then implemented conventional Mendelian randomisation (MR) analysis to investigate the causal relevance between SUA level and disease outcomes identified from PheWAS. We next applied MR Egger analysis to detect and account for potential pleiotropy, which conventional MR analysis might mistake for causality, and used the HEIDI (heterogeneity in dependent instruments) test to remove cross-phenotype associations that were likely due to genetic linkage. Our PheWAS identified 25 disease groups/outcomes associated with SUA genetic risk loci after multiple testing correction (P<8.57e-05). Our conventional MR analysis implicated a causal role of SUA level in three disease groups: inflammatory polyarthropathies ...
Genetic association studies often examine features independently, potentially missing subpopulati... more Genetic association studies often examine features independently, potentially missing subpopulations with multiple phenotypes that share a single cause. We describe an approach that aggregates phenotypes on the basis of patterns described by Mendelian diseases. We mapped the clinical features of 1204 Mendelian diseases into phenotypes captured from the electronic health record (EHR) and summarized this evidence as phenotype risk scores (PheRSs). In an initial validation, PheRS distinguished cases and controls of five Mendelian diseases. Applying PheRS to 21,701 genotyped individuals uncovered 18 associations between rare variants and phenotypes consistent with Mendelian diseases. In 16 patients, the rare genetic variants were associated with severe outcomes such as organ transplants. PheRS can augment rare-variant interpretation and may identify subsets of patients with distinct genetic causes for common diseases.
To compare three groupings of Electronic Health Record (EHR) billing codes for their ability to r... more To compare three groupings of Electronic Health Record (EHR) billing codes for their ability to represent clinically meaningful phenotypes and to replicate known genetic associations. The three tested coding systems were the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes, the Agency for Healthcare Research and Quality Clinical Classification Software for ICD-9-CM (CCS), and manually curated "phecodes" designed to facilitate phenome-wide association studies (PheWAS) in EHRs. We selected 100 disease phenotypes and compared the ability of each coding system to accurately represent them without performing additional groupings. The 100 phenotypes included 25 randomly-chosen clinical phenotypes pursued in prior genome-wide association studies (GWAS) and another 75 common disease phenotypes mentioned across free-text problem lists from 189,289 individuals. We then evaluated the performance of each coding system to replicate known ...
Journal of the American Medical Informatics Association, 2016
Objective: Phenotyping algorithms applied to electronic health record (EHR) data enable investiga... more Objective: Phenotyping algorithms applied to electronic health record (EHR) data enable investigators to identify large cohorts for clinical and genomic research. Algorithm development is often iterative, depends on fallible investigator intuition, and is time- and labor-intensive. We developed and evaluated 4 types of phenotyping algorithms and categories of EHR information to identify hypertensive individuals and controls and provide a portable module for implementation at other sites.Materials and Methods: We reviewed the EHRs of 631 individuals followed at Vanderbilt for hypertension status. We developed features and phenotyping algorithms of increasing complexity. Input categories included International Classification of Diseases, Ninth Revision (ICD9) codes, medications, vital signs, narrative-text search results, and Unified Medical Language System (UMLS) concepts extracted using natural language processing (NLP). We developed a module and tested portability by replicating 10...
Understanding the contribution of genetic variation to drug response can improve the delivery of ... more Understanding the contribution of genetic variation to drug response can improve the delivery of precision medicine. However, genome-wide association studies (GWAS) for drug response are uncommon and are often hindered by small sample sizes. We present a high-throughput framework to efficiently identify eligible patients for genetic studies of adverse drug reactions (ADRs) using “drug allergy” labels from electronic health records (EHRs). As a proof-of-concept, we conducted GWAS for ADRs to 14 common drug/drug groups with 81,739 individuals from Vanderbilt University Medical Center’s BioVU DNA Biobank. We identified 7 genetic loci associated with ADRs at P < 5 × 10−8, including known genetic associations such as CYP2D6 and OPRM1 for CYP2D6-metabolized opioid ADR. Additional expression quantitative trait loci and phenome-wide association analyses added evidence to the observed associations. Our high-throughput framework is both scalable and portable, enabling impactful pharmacogen...
Introduction Currently, one of the commonly used methods for disseminating electronic health reco... more Introduction Currently, one of the commonly used methods for disseminating electronic health record (EHR)-based phenotype algorithms is providing a narrative description of the algorithm logic, often accompanied by flowcharts. A challenge with this mode of dissemination is the potential for under-specification in the algorithm definition, which leads to ambiguity and vagueness. Methods This study examines incidents of under-specification that occurred during the implementation of 34 narrative phenotyping algorithms in the electronic Medical Record and Genomics (eMERGE) network. We reviewed the online communication history between algorithm developers and implementers within the Phenotype Knowledge Base (PheKB) platform, where questions could be raised and answered regarding the intended implementation of a phenotype algorithm. Results We developed a taxonomy of under-specification categories via an iterative review process between two groups of annotators. Under-specifications that ...
Discovering novel uses for existing drugs, through drug repurposing, can reduce the time, costs, ... more Discovering novel uses for existing drugs, through drug repurposing, can reduce the time, costs, and risk of failure associated with new drug development. However, prioritizing drug repurposing candidates for downstream studies remains challenging. Here, we present a high-throughput approach to identify and validate drug repurposing candidates. This approach integrates human gene expression, drug perturbation, and clinical data from publicly available resources. We apply this approach to find drug repurposing candidates for two diseases, hyperlipidemia and hypertension. We screen >21,000 compounds and replicate ten approved drugs. We also identify 25 (seven for hyperlipidemia, eighteen for hypertension) drugs approved for other indications with therapeutic effects on clinically relevant biomarkers. For five of these drugs, the therapeutic effects are replicated in the All of Us Research Program database. We anticipate our approach will enable researchers to integrate multiple pub...
The MEDication-Indication (MEDI) knowledgebase has been utilized in research with electronic heal... more The MEDication-Indication (MEDI) knowledgebase has been utilized in research with electronic health records (EHRs) since its publication in 2013. To account for new drugs and terminology updates, we rebuilt MEDI to overhaul the knowledgebase for modern EHRs. Indications for prescribable medications were extracted using natural language processing and ontology relationships from six publicly available resources: RxNorm, Side Effect Resource 4.1, Mayo Clinic, WebMD, MedlinePlus, and Wikipedia. We compared the estimated precision and recall between the previous MEDI (MEDI-1) and the updated version (MEDI-2) with manual review. MEDI-2 contains 3031 medications and 186,064 indications. The MEDI-2 high precision subset (HPS) includes indications found within RxNorm or at least three other resources. MEDI-2 and MEDI-2 HPS contain 13% more medications and over triple the indications compared to MEDI-1 and MEDI-1 HPS, respectively. Manual review showed MEDI-2 achieves the same precision (0.6...
Background & AimsPrimary non-response (PNR) to anti-tumor necrosis factor-α (TNFα) biologics is a... more Background & AimsPrimary non-response (PNR) to anti-tumor necrosis factor-α (TNFα) biologics is a serious concern in patients with inflammatory bowel disease (IBD). We aimed to identify the genetic variants associated with PNR.MethodsPatients were recruited from outpatient GI clinics and PNR was determined using both clinical and endoscopic findings. A case-control genome-wide association study was performed in 589 IBD patients and associations were replicated in an independent cohort of 293 patients. Effect of the associated variant on gene expression and TNFα secretion was assessed by cell-based assays. Pleiotropic effects were investigated by Phenome-wide Association Study (PheWAS).ResultsWe identified rs34767465 as associated with PNR to anti-TNF-α therapy (OR:2.07, 95%CI:1.46-2.94, p=2.43×10−7, [Replication OR:1.8, 95%CI:1.04-3.16, p=0.03]). rs34767465 is a multiple-tissue expression quantitative trait loci for FAM114A2. Using RNA-sequencing and protein quantification from HapM...
ABSTRACTIntroductionThe role of serum urate level has been extensively investigated in observatio... more ABSTRACTIntroductionThe role of serum urate level has been extensively investigated in observational studies. However, the extent of any causal effect remains unclear, making it difficult to evaluate its clinical relevance.ObjectivesTo explore any causal or pleiotropic association between serum urate level and a broad spectrum of disease outcomes.MethodsPhenome-wide association study (PheWAS) together with a Bayesian analysis of tree-structured phenotypic models (TreeWAS) was performed to examine disease outcomes related to genetically determined serum urate levels in 339,256 UK Biobank participants. Mendelian Randomisation (MR) analyses were performed to replicate significant findings using various GWAS consortia data. Sensitivity analyses were conducted to examine possible pleiotropic effects on metabolic traits of the genetic variants used as instruments for serum urate.ResultsPheWAS analysis, examining the association with 1,431 disease outcomes, identified a multitude of diseas...
Analyzing 5770 all-cause cirrhosis cases and 572,850 controls from seven cohorts, we identify a m... more Analyzing 5770 all-cause cirrhosis cases and 572,850 controls from seven cohorts, we identify a missense variant in the Mitochondrial Amidoxime Reducing Component 1 gene (MARC1 p.A165T) that associates with protection from all-cause cirrhosis (OR 0.88, p=2.1*10−8). This same variant also associates with lower levels of hepatic fat on computed tomographic imaging and lower odds of physician-diagnosed fatty liver as well as lower blood levels of alanine transaminase (−0.012 SD, 1.4*10−8), alkaline phosphatase (−0.019 SD, 6.6*10−9), total cholesterol (−0.037 SD, p=1*10−18) and LDL cholesterol (−0.035 SD, p=7.3*10−16). Carriers of rare protein-truncating variants in MARC1 had lower liver enzyme levels, cholesterol levels, and reduced odds of liver disease (OR 0.19, p= 0.04) suggesting that deficiency of the MARC1 enzyme protects against cirrhosis.
Genome-wide and phenome-wide association studies are commonly used to identify important relation... more Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most of these studies have treated diseases as independent variables and suffered from heavy multiple adjustment burdens due to the large number of genetic variants and disease phenotypes. In this study, we propose using topic modeling via non-negative matrix factorization (NMF) for identifying associations between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine learning approach that can be used to learn the semantic patterns from electronic health record data. We chose rs10455872 in LPA as the predictor since it has been shown to be associated with increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 individuals from the biobank at Vanderbilt University Medical Center, we trained a topic model using NMF from 1,853 distinct phecodes extracted from the cohort’s electronic...
BackgroundThe PheCode system was built upon the International Classification of Diseases, Ninth R... more BackgroundThe PheCode system was built upon the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) for phenome-wide association studies (PheWAS) in the electronic health record (EHR).ObjectiveHere, we present our work on the development and evaluation of maps from ICD-10 and ICD-10-CM codes to PheCodes.MethodsWe mapped ICD-10 and ICD-10-CM codes to PheCodes using a number of methods and resources, such as concept relationships and explicit mappings from the Unified Medical Language System (UMLS), Observational Health Data Sciences and Informatics (OHDSI), Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT), and National Library of Medicine (NLM). We assessed the coverage of the maps in two databases: Vanderbilt University Medical Center (VUMC) using ICD-10-CM and the UK Biobank (UKBB) using ICD-10. We assessed the fidelity of the ICD-10-CM map in comparison to the gold-standard ICD-9-CM→PheCode map by investigating phenotype rep...
BackgroundDrug effects can be investigated through natural variation in the genes for their prote... more BackgroundDrug effects can be investigated through natural variation in the genes for their protein targets. We aimed to use this approach to explore the potential side effects and repurposing potential of antihypertensive drugs, which are amongst the most commonly used medications worldwide.MethodsWe identified genetic instruments for antihypertensive drug classes as variants in the gene for the corresponding target that associated with systolic blood pressure at genome-wide significance. To validate the instruments, we compared Mendelian randomisation (MR) estimates for drug effects on coronary heart disease (CHD) and stroke risk to randomised controlled trial (RCT) results. Phenome-wide association study (PheWAS) in the UK Biobank was performed to identify potential side effects and repurposing opportunities, with findings investigated in the Vanderbilt University Biobank (BioVU) and in observational analysis of the UK Biobank.FindingsWe identified suitable genetic instruments fo...
In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most bin... more In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for >...
We aimed to investigate the role of serum uric acid (SUA) level in a broad spectrum of disease ou... more We aimed to investigate the role of serum uric acid (SUA) level in a broad spectrum of disease outcomes using data for 120 091 individuals from UK Biobank. We performed a phenome-wide association study (PheWAS) to identify disease outcomes associated with SUA genetic risk loci. We then implemented conventional Mendelian randomisation (MR) analysis to investigate the causal relevance between SUA level and disease outcomes identified from PheWAS. We next applied MR Egger analysis to detect and account for potential pleiotropy, which conventional MR analysis might mistake for causality, and used the HEIDI (heterogeneity in dependent instruments) test to remove cross-phenotype associations that were likely due to genetic linkage. Our PheWAS identified 25 disease groups/outcomes associated with SUA genetic risk loci after multiple testing correction (P<8.57e-05). Our conventional MR analysis implicated a causal role of SUA level in three disease groups: inflammatory polyarthropathies ...
Genetic association studies often examine features independently, potentially missing subpopulati... more Genetic association studies often examine features independently, potentially missing subpopulations with multiple phenotypes that share a single cause. We describe an approach that aggregates phenotypes on the basis of patterns described by Mendelian diseases. We mapped the clinical features of 1204 Mendelian diseases into phenotypes captured from the electronic health record (EHR) and summarized this evidence as phenotype risk scores (PheRSs). In an initial validation, PheRS distinguished cases and controls of five Mendelian diseases. Applying PheRS to 21,701 genotyped individuals uncovered 18 associations between rare variants and phenotypes consistent with Mendelian diseases. In 16 patients, the rare genetic variants were associated with severe outcomes such as organ transplants. PheRS can augment rare-variant interpretation and may identify subsets of patients with distinct genetic causes for common diseases.
To compare three groupings of Electronic Health Record (EHR) billing codes for their ability to r... more To compare three groupings of Electronic Health Record (EHR) billing codes for their ability to represent clinically meaningful phenotypes and to replicate known genetic associations. The three tested coding systems were the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes, the Agency for Healthcare Research and Quality Clinical Classification Software for ICD-9-CM (CCS), and manually curated "phecodes" designed to facilitate phenome-wide association studies (PheWAS) in EHRs. We selected 100 disease phenotypes and compared the ability of each coding system to accurately represent them without performing additional groupings. The 100 phenotypes included 25 randomly-chosen clinical phenotypes pursued in prior genome-wide association studies (GWAS) and another 75 common disease phenotypes mentioned across free-text problem lists from 189,289 individuals. We then evaluated the performance of each coding system to replicate known ...
Journal of the American Medical Informatics Association, 2016
Objective: Phenotyping algorithms applied to electronic health record (EHR) data enable investiga... more Objective: Phenotyping algorithms applied to electronic health record (EHR) data enable investigators to identify large cohorts for clinical and genomic research. Algorithm development is often iterative, depends on fallible investigator intuition, and is time- and labor-intensive. We developed and evaluated 4 types of phenotyping algorithms and categories of EHR information to identify hypertensive individuals and controls and provide a portable module for implementation at other sites.Materials and Methods: We reviewed the EHRs of 631 individuals followed at Vanderbilt for hypertension status. We developed features and phenotyping algorithms of increasing complexity. Input categories included International Classification of Diseases, Ninth Revision (ICD9) codes, medications, vital signs, narrative-text search results, and Unified Medical Language System (UMLS) concepts extracted using natural language processing (NLP). We developed a module and tested portability by replicating 10...
Uploads
Papers by Wei-qi Wei