Predicting effects of noncoding variants with deep learningâbased sequence model

Zhou, Jian; Troyanskaya, Olga G

doi:10.1038/nmeth.3547

Brief Communication
Published: 24 August 2015

Predicting effects of noncoding variants with deep learningâbased sequence model

Jian Zhou^1,2 &
Olga G Troyanskaya^1,3,4Â

Nature Methods volumeÂ 12,Â pages 931â934 (2015)Cite this article

80k Accesses
1053 Citations
150 Altmetric
Metrics details

Subjects

Abstract

Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learningâbased algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 2: The deep-learning model accurately predicts chromatin features from sequence with single-nucleotide sensitivity.**

**Figure 3: Sequence-based prioritization of functional noncoding variants.**

Effective gene expression prediction from sequence by integrating long-range interactions

Article Open access 04 October 2021

A sequence-based global map of regulatory activity for deciphering human genetics

Article Open access 11 July 2022

Fundamentals for predicting transcriptional regulations from DNA sequence patterns

Article Open access 10 May 2024

References

Leslie, R., O'Donnell, C.J. & Johnson, A.D. Bioinformatics 30, i185âi194 (2014).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Ritchie, G.R., Dunham, I., Zeggini, E. & Flicek, P. Nat. Methods 11, 294â296 (2014).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Kircher, M. et al. Nat. Genet. 46, 310â315 (2014).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Fu, Y. et al. Genome Biol. 15, 480 (2014).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Lee, D. et al. Nat. Genet. 47, 955â961 (2015).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Slattery, M. et al. Trends Biochem. Sci. 39, 381â399 (2014).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Benveniste, D., Sonntag, H.J., Sanguinetti, G. & Sproul, D. Proc. Natl. Acad. Sci. USA 111, 13367â13372 (2014).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Whitaker, J.W., Chen, Z. & Wang, W. Nat. Methods 12, 265â272 (2015).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
ENCODE Project Consortium. Nature 489, 57â74 (2012).
Kundaje, A. et al. Nature 518, 317â330 (2015).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Arvey, A., Agius, P., Noble, W.S. & Leslie, C. Genome Res. 22, 1723â1734 (2012).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M.A. PLoS Comput. Biol. 10, e1003711 (2014).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Neph, S. et al. Nature 489, 83â90 (2012).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Cowper-SalÂ·lari, R. et al. Nat. Genet. 44, 1191â1198 (2012).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
De Gobbi, M. et al. Science 312, 1215â1217 (2006).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Weedon, M.N. et al. Nat. Genet. 46, 61â64 (2014).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Stenson, P.D. et al. Hum. Genet. 133, 1â9 (2014).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Welter, D. et al. Nucleic Acids Res. 42, D1001âD1006 (2014).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Abecasis, G.R. et al. Nature 491, 56â65 (2012).
PubMedÂ Google ScholarÂ
Koboldt, D.C. et al. Genome Res. 22, 568â576 (2012).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
McVicker, G. et al. Science 342, 747â749 (2013).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Karolchik, D. et al. Nucleic Acids Res. 42, D764âD770 (2014).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Siepel, A. et al. Genome Res. 15, 1034â1050 (2005).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Genome Res. 20, 110â121 (2010).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Cooper, G.M. et al. Genome Res. 15, 901â913 (2005).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Davydov, E.V. et al. PLoS Comput. Biol. 6, e1001025 (2010).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ

Download references

Acknowledgements

This work was primarily supported by US National Institutes of Health (NIH) grants R01 GM071966 and R01 HG005998 to O.G.T. This work was supported in part by the US National Science Foundation (NSF) CAREER award (DBI-0546275), NIH award T32 HG003284 and NIH grant P50 GM071508. O.G.T. is supported by the Genetic Networks program of the Canadian Institute for Advanced Research (CIFAR). We acknowledge the TIGRESS high-performance computer center at Princeton University for computational resource support. We are grateful to all Troyanskaya laboratory members for valuable discussions.

Author information

Authors and Affiliations

Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA
Jian ZhouÂ &Â Olga G Troyanskaya
Graduate Program in Quantitative and Computational Biology, Princeton University, Princeton, New Jersey, USA
Jian Zhou
Department of Computer Science, Princeton University, Princeton, New Jersey, USA
Olga G Troyanskaya
Simons Center for Data Analysis, Simons Foundation, New York, New York, USA
Olga G Troyanskaya

Authors

Jian Zhou
View author publications
You can also search for this author in PubMedÂ Google Scholar
Olga G Troyanskaya
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

J.Z. designed the study, with input from O.G.T. J.Z. developed the method and analyzed the results. O.G.T. supervised the study. J.Z. and O.G.T. wrote the paper.

Corresponding author

Correspondence to Olga G Troyanskaya.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Performance comparison of DeepSEA models trained with different context sequence lengths

DeepSEA models with the same architecture as described in the Online Methods were trained on 200bp, 500bp, and 1000bp input sequences respectively, and the AUCs of all chromatin features were shown with box plots. While the chromatin feature labels were always determined from the central 200bp regions, increasing context sequence length significantly improved model performance (P-value < 2.2e-16 by Wilcoxon signed rank test between any pair of models).

Supplementary Figure 2 Performance comparison with gkm-SVM

Deep convolutional network model outperformed gapped k-mer SVM (gkm-SVM) on transcription factor binding prediction. Deep convolutional network achieved higher area under receiver operating characteristic (AUC) for almost all transcription factors (left panel). Gapped k-mer SVM did not gain performance from increasing size of context sequences (right panel).

Supplementary Figure 3 In silico saturated mutagenesis analysis for identifying predictive sequence features

Predictive sequence features can be identified by analyzing effect on binding probability by computationally mutating each base. Each column in a heatmap represents a base position in the sequence. The three rows represent the three possible base substitutions following A>G>C>T order from bottom to top. For example, if the original sequence has base G, then the three rows represent C, T, A from bottom to top. The log2 fold change of odds (odds are computed from probability as P/(1 â P) are shown with the heatmap; yellow indicates increase of binding and blue indicates decrease of binding. Each sequence example is shown by two panels. The first (top) panel shows the âmutation scanningâ results on the whole 1000bp sequence. The second (bottom) panel focuses on the center 200bp in order to show the actual nucleotide sequences. Many sequence elements identified are consistent with canonical motifs such as TTGCTCAA for CEBPB, TGATAA for GATA1, GTAAATA for FOXA1 and GTACATA for FOXA2. The four example sequences shown in this figure are centered around SNPs chr1:109817590 G>T, chr16:209709 T>C, chr10:23508363 A>G, chr16:52599188 C>T respectively.

Supplementary Figure 4 DeepSEA accurately predicted histone QTL effects

DeepSEA histone mark classifiers provided accurate prediction of allele specific effects on histone marks H3K4me3 and H3K27ac (the allele with more histone mark). The top prediction accuracies are over 0.9 for both marks. The predictions were evaluated with histone mark QTLs identified with FDR < 0.1 in Yoruba lymphoblastoid cell lines¹. Margin shown on the x axis is the threshold of predicted probability differences between the two alleles for classifying high-confidence predictions. Performance is measured by accuracy of the above threshold predictions (y axis).

1. McVicker, G. et al. Science 342, 747â749 (2013).

Supplementary Figure 5 Flow diagram for DeepSEA functional SNP prioritization

For each input variant, DeepSEA computes 1842 features, including 1838 predicted chromatin effect features and 4 evolutionary conservation features. Predicted chromatin effect features include absolute difference and relative difference computed based on predicted probability of reference and alternative sequences, for each TF / DNase / Histone chromatin feature. Evolutionary conservation scores based on multi-species genome alignments were retrieved for the variant positions. Each feature is taken the absolute value, and is then scaled to mean 0 and variance 1 before providing as input to classifier.

Supplementary Figure 6 DeepSEA functional significance score prioritizes functional noncoding variants with high performance

DeepSEA functional significance score measures the overall significance of predicted chromatin effects and evolutionary conservation scores, and it is unsupervised thus unbiased to any training functional variant annotation set (see Online Methods). Notably DeepSEA functional significance score still surpassed the performance of previous methods even though no supervised training was used (compare to Fig. 3). The performance was measured by area under receiver operating characteristic (AUC). x axis shows the average distances of negative-variant groups to a nearest positive variant. The âallâ negative-variant groups are randomly selected negative 1000 Genomes SNPs.

Supplementary Figure 7 Dissecting DeepSEA functional SNP prioritization performance with subsets of input features

DeepSEA functional SNP prioritization models performance on HGMD regulatory mutations, noncoding eQTLs, and noncoding trait-associated (GWAS) SNPs was analyzed by comparing with models trained with only predicted chromatin effect features or only evolutionary conservation features. The performance was measured by area under receiver operating characteristic (AUC). x axis shows the average distances of negative-variant groups to a nearest positive variant. The âallâ negative-variant groups are randomly selected negative 1000 Genomes SNPs.

Supplementary Figure 8 DeepSEA-based classifier prioritized functionally annotated indels with high performance

HGMD regulatory indels prioritization performance was evaluated against negative 1000 Genomes indel groups with different distances to positive indels (average distance shown on the x-axis). The performance was measured by area under receiver operating characteristic (AUC). The prioritization model was trained with HGMD regulatory single nucleotide substitution mutations against 1200bp average distance negative variants.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, J., Troyanskaya, O. Predicting effects of noncoding variants with deep learningâbased sequence model. Nat Methods 12, 931â934 (2015). https://doi.org/10.1038/nmeth.3547

Download citation

Received: 26 February 2015
Accepted: 11 June 2015
Published: 24 August 2015
Issue Date: October 2015
DOI: https://doi.org/10.1038/nmeth.3547