A general framework for predicting the transcriptomic consequences of non-coding variation and small molecules

Moustafa Abdalla; Mohamed Abdalla

doi:10.5281/zenodo.6400074

Published March 30, 2022 | Version v1

Software Open

A general framework for predicting the transcriptomic consequences of non-coding variation and small molecules

1. University of Oxford
2. University of Toronto

Summary:

Associated code, python notebooks, and data for the manuscript entitled 'A general framework for predicting the transcriptomic consequences of non-coding variation and small molecules' (published in PLoS Computational Biology; doi: 10.1371/journal.pcbi.1010028).

Stage 1 peaBrain:

In Stage 1, we constructed a single model to predict the mean abundance of all genes in any given tissue from the reference genome, optionally annotated with epigenetic and genomic annotations. We applied this framework to all tissues from the GTEx dataset, constructing three classes of models: (a) using DNA sequence alone (class-A); (b) using DNA plus epigenomic annotations not specific to any tissue or cell type (i.e. non-specific annotations) (class-B); and (c) using DNA combined with both non-specific and tissue-specific annotations (class-C). We have provided all code and data necessary to generate the results for class-A and class-B models. Due to storage constraints, we provide training/test data only for skeletal muscle. Expression data for other tissues is available from GTEx. The original data sources used to train class-C models are detailed in the manuscript.

Using the Stage 1 class-B models, we generated a non-coding impact metric that captured the impact of each position in the core promoter sequence on the expression of each gene. The peaBrain impact scores for all GTEx tissues have been made available. In the manuscript, we show that this impact score correlates with nucleotide evolutionary constraint and is also predictive of disease-associated variation and allele-specific transcription factor binding. We also highlight how tissue-specific peaBrain scores can be leveraged to pinpoint functional tissues underlying complex traits, outperforming methods that depend on colocalization of eQTL and GWAS signals.

Stage 2 peaBrain:

In Stage 2, we extended the peaBrain model to incorporate the transcriptomic consequences of individual genotype variation. In the manuscript, we describe the ability of this extended peaBrain model to predict the tissue-specific expression profile of each individual and to identify putatively functional variants within the sequence. Sample code has been provided. Individual level data is available from GTEx.

Files

Files (988.4 MB)

Name	Size	Download all
Stage1.tar.gz md5:e494f1813859e878c5b3c46260a7e29d	987.7 MB	Download
Stage1_classB_model_HTMLpythonnotebook.html md5:e1ad434edd5f59e0c016b6e45045d69d	328.1 kB	Download
Stage2.tar.gz md5:fe5d7bdd0b3c906dbdcea0f9e87747cb	3.9 kB	Download
Stage2_sample_HTMLpythonnotebook.html md5:3691f57761a0e7156c451a11752df354	355.4 kB	Download

Additional details

Is supplement to: Journal article: 10.1371/journal.pcbi.1010028 (DOI)

	All versions	This version
Views	179	177
Downloads	107	107
Data volume	29.7 GB	29.7 GB

A general framework for predicting the transcriptomic consequences of non-coding variation and small molecules

Creators

Description

Files

Files (988.4 MB)

Additional details

Related works