PCA-Ethnicity-Determination-from-WGS-Data

A pipeline utilizing 1000 Genomes data and WGS data from your own samples to determine or validate ethnicity of an individual.

The goal of this pipeline is to determine ancestry of an individual using sequencing data (SNPs) starting with hg38 variant called files (VCF) from those individuals. The cohort data is then combined/overlayed with 1000 Genomes data and PCA analysis is performed. PCA scores are then plotted along with 1000 genomes data to provide a visual representation of where each individual falls on the overall PCA plot of ancestry.

Some requirements for this pipeline:

filtered VCF files from your own samples (hg38)
bcftools
plink2
vcftools
R and R-studio (for plotting)
1000 genomes data files http://hgdownload.soe.ucsc.edu/gbdb/hg38/1000Genomes/
if processing many samples: high performance computing cluster

Instructions:

Perform the steps outlined in the bash script 1-determine-ancestry-by-PCA
In R, perform the steps outlined in 2-plot.R

The output of this ancestry calling pipeline will give you a plot with 1000 genomes super populations and your own samples overlayed on top of the super population they most closely resemble based on the SNV data.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
1-determine-ancestry-by-PCA.sh		1-determine-ancestry-by-PCA.sh
2-plot.R		2-plot.R
README.md		README.md
sample_info_file.csv		sample_info_file.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PCA-Ethnicity-Determination-from-WGS-Data

About

Releases

Packages

Languages

laura-budurlean/PCA-Ethnicity-Determination-from-WGS-Data

Folders and files

Latest commit

History

Repository files navigation

PCA-Ethnicity-Determination-from-WGS-Data

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages