Abstract
Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learningâbased algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Leslie, R., O'Donnell, C.J. & Johnson, A.D. Bioinformatics 30, i185âi194 (2014).
Ritchie, G.R., Dunham, I., Zeggini, E. & Flicek, P. Nat. Methods 11, 294â296 (2014).
Kircher, M. et al. Nat. Genet. 46, 310â315 (2014).
Fu, Y. et al. Genome Biol. 15, 480 (2014).
Lee, D. et al. Nat. Genet. 47, 955â961 (2015).
Slattery, M. et al. Trends Biochem. Sci. 39, 381â399 (2014).
Benveniste, D., Sonntag, H.J., Sanguinetti, G. & Sproul, D. Proc. Natl. Acad. Sci. USA 111, 13367â13372 (2014).
Whitaker, J.W., Chen, Z. & Wang, W. Nat. Methods 12, 265â272 (2015).
ENCODE Project Consortium. Nature 489, 57â74 (2012).
Kundaje, A. et al. Nature 518, 317â330 (2015).
Arvey, A., Agius, P., Noble, W.S. & Leslie, C. Genome Res. 22, 1723â1734 (2012).
Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M.A. PLoS Comput. Biol. 10, e1003711 (2014).
Neph, S. et al. Nature 489, 83â90 (2012).
Cowper-Sal·lari, R. et al. Nat. Genet. 44, 1191â1198 (2012).
De Gobbi, M. et al. Science 312, 1215â1217 (2006).
Weedon, M.N. et al. Nat. Genet. 46, 61â64 (2014).
Stenson, P.D. et al. Hum. Genet. 133, 1â9 (2014).
Welter, D. et al. Nucleic Acids Res. 42, D1001âD1006 (2014).
Abecasis, G.R. et al. Nature 491, 56â65 (2012).
Koboldt, D.C. et al. Genome Res. 22, 568â576 (2012).
McVicker, G. et al. Science 342, 747â749 (2013).
Karolchik, D. et al. Nucleic Acids Res. 42, D764âD770 (2014).
Siepel, A. et al. Genome Res. 15, 1034â1050 (2005).
Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Genome Res. 20, 110â121 (2010).
Cooper, G.M. et al. Genome Res. 15, 901â913 (2005).
Davydov, E.V. et al. PLoS Comput. Biol. 6, e1001025 (2010).
Acknowledgements
This work was primarily supported by US National Institutes of Health (NIH) grants R01 GM071966 and R01 HG005998 to O.G.T. This work was supported in part by the US National Science Foundation (NSF) CAREER award (DBI-0546275), NIH award T32 HG003284 and NIH grant P50 GM071508. O.G.T. is supported by the Genetic Networks program of the Canadian Institute for Advanced Research (CIFAR). We acknowledge the TIGRESS high-performance computer center at Princeton University for computational resource support. We are grateful to all Troyanskaya laboratory members for valuable discussions.
Author information
Authors and Affiliations
Contributions
J.Z. designed the study, with input from O.G.T. J.Z. developed the method and analyzed the results. O.G.T. supervised the study. J.Z. and O.G.T. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Performance comparison of DeepSEA models trained with different context sequence lengths
DeepSEA models with the same architecture as described in the Online Methods were trained on 200bp, 500bp, and 1000bp input sequences respectively, and the AUCs of all chromatin features were shown with box plots. While the chromatin feature labels were always determined from the central 200bp regions, increasing context sequence length significantly improved model performance (P-value < 2.2e-16 by Wilcoxon signed rank test between any pair of models).
Supplementary Figure 2 Performance comparison with gkm-SVM
Deep convolutional network model outperformed gapped k-mer SVM (gkm-SVM) on transcription factor binding prediction. Deep convolutional network achieved higher area under receiver operating characteristic (AUC) for almost all transcription factors (left panel). Gapped k-mer SVM did not gain performance from increasing size of context sequences (right panel).
Supplementary Figure 3 In silico saturated mutagenesis analysis for identifying predictive sequence features
Predictive sequence features can be identified by analyzing effect on binding probability by computationally mutating each base. Each column in a heatmap represents a base position in the sequence. The three rows represent the three possible base substitutions following A>G>C>T order from bottom to top. For example, if the original sequence has base G, then the three rows represent C, T, A from bottom to top. The log2 fold change of odds (odds are computed from probability as P/(1 â P) are shown with the heatmap; yellow indicates increase of binding and blue indicates decrease of binding. Each sequence example is shown by two panels. The first (top) panel shows the âmutation scanningâ results on the whole 1000bp sequence. The second (bottom) panel focuses on the center 200bp in order to show the actual nucleotide sequences. Many sequence elements identified are consistent with canonical motifs such as TTGCTCAA for CEBPB, TGATAA for GATA1, GTAAATA for FOXA1 and GTACATA for FOXA2. The four example sequences shown in this figure are centered around SNPs chr1:109817590 G>T, chr16:209709 T>C, chr10:23508363 A>G, chr16:52599188 C>T respectively.
Supplementary Figure 4 DeepSEA accurately predicted histone QTL effects
DeepSEA histone mark classifiers provided accurate prediction of allele specific effects on histone marks H3K4me3 and H3K27ac (the allele with more histone mark). The top prediction accuracies are over 0.9 for both marks. The predictions were evaluated with histone mark QTLs identified with FDR < 0.1 in Yoruba lymphoblastoid cell lines1. Margin shown on the x axis is the threshold of predicted probability differences between the two alleles for classifying high-confidence predictions. Performance is measured by accuracy of the above threshold predictions (y axis).
1. McVicker, G. et al. Science 342, 747â749 (2013).
Supplementary Figure 5 Flow diagram for DeepSEA functional SNP prioritization
For each input variant, DeepSEA computes 1842 features, including 1838 predicted chromatin effect features and 4 evolutionary conservation features. Predicted chromatin effect features include absolute difference and relative difference computed based on predicted probability of reference and alternative sequences, for each TF / DNase / Histone chromatin feature. Evolutionary conservation scores based on multi-species genome alignments were retrieved for the variant positions. Each feature is taken the absolute value, and is then scaled to mean 0 and variance 1 before providing as input to classifier.
Supplementary Figure 6 DeepSEA functional significance score prioritizes functional noncoding variants with high performance
DeepSEA functional significance score measures the overall significance of predicted chromatin effects and evolutionary conservation scores, and it is unsupervised thus unbiased to any training functional variant annotation set (see Online Methods). Notably DeepSEA functional significance score still surpassed the performance of previous methods even though no supervised training was used (compare to Fig. 3). The performance was measured by area under receiver operating characteristic (AUC). x axis shows the average distances of negative-variant groups to a nearest positive variant. The âallâ negative-variant groups are randomly selected negative 1000 Genomes SNPs.
Supplementary Figure 7 Dissecting DeepSEA functional SNP prioritization performance with subsets of input features
DeepSEA functional SNP prioritization models performance on HGMD regulatory mutations, noncoding eQTLs, and noncoding trait-associated (GWAS) SNPs was analyzed by comparing with models trained with only predicted chromatin effect features or only evolutionary conservation features. The performance was measured by area under receiver operating characteristic (AUC). x axis shows the average distances of negative-variant groups to a nearest positive variant. The âallâ negative-variant groups are randomly selected negative 1000 Genomes SNPs.
Supplementary Figure 8 DeepSEA-based classifier prioritized functionally annotated indels with high performance
HGMD regulatory indels prioritization performance was evaluated against negative 1000 Genomes indel groups with different distances to positive indels (average distance shown on the x-axis). The performance was measured by area under receiver operating characteristic (AUC). The prioritization model was trained with HGMD regulatory single nucleotide substitution mutations against 1200bp average distance negative variants.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1â8 and Supplementary Note (PDF 794 kb)
Supplementary Table 1
List of all publicly available chromatin feature profile files used for training DeepSEA (XLSX 22 kb)
Supplementary Table 2
DeepSEA prediction performance for each transcription factor, DNase I hypersensitive site, and histone mark profile (XLSX 79 kb)
Supplementary Table 3
Sequence based allele specific DNase I hypersensitivity predictions for allele imbalanced variants called from Digital Genomic Footprinting DNase-seq data (CSV 4745 kb)
Supplementary Table 4
Allele-imbalance DNase I hypersensitivity prediction performance for 35 cell types (XLSX 13 kb)
Supplementary Table 5
DeepSEA functional variant prioritization model predictions for noncoding GRASP eQTLs and negative variants sets (CSV 63482 kb)
Supplementary Table 6
DeepSEA functional variant prioritization model predictions for noncoding GWAS Catalog SNPs and negative variant sets. (CSV 56051 kb)
Supplementary Table 7
Feature rankings for noncoding functional variant prioritization tasks. (XLSX 841 kb)
Rights and permissions
About this article
Cite this article
Zhou, J., Troyanskaya, O. Predicting effects of noncoding variants with deep learningâbased sequence model. Nat Methods 12, 931â934 (2015). https://doi.org/10.1038/nmeth.3547
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3547