Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17

© 2015 MapR Technologies 1© 2015 MapR Technologies
Hadoop for Genomics: What you need to know

© 2015 MapR Technologies 2
Target Application: Alleviate / Prevent Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient

Growth in Resource Capacity

Disruption Circa 2000
NASDAQ
Composite

What Happened?
What did winners
do right to survive
the .com recession?
NASDAQ
Composite

Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Ofﬁce

Late 1990s: Workload became too big
Storage
read/write
read/write
Website WebsiteWebsite Website
Back Ofﬁce Back Ofﬁce

Google Publishes
• 2003: Google Filesystem (aka GFS)
– http://research.google.com/archive/gfs.html
• 2004: MapReduce
– http://research.google.com/archive/mapreduce.html
• 2006: BigTable
– http://research.google.com/archive/bigtable.html

Scale-out with Google FS + MapReduce
read/write
read/write
Website WebsiteWebsite Website
Storage + Compute Cluster
Back Ofﬁce Back Ofﬁce

Apache Software Foundation: Fast Follower of Google
MapReduce Hadoop
Google FS
Hadoop FS
BigTable
HBase

DNA Sequencing, post-2004 DNA Sequence
NASDAQ
Composite

DNA Sequencing, pre-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
Sequencer

DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
HPC bottleneck
Sequencer
back-pressure

Solution: Implemented 2014 @ Sequencer Vendor
(with MapR)
write-only
DNA Sequencer Cluster (e.g. Illumina X-Ten
Storage + Compute Cluster
Decentralize I/O
Decentralize I/O

Allows Secondary Analytics to Scale Out
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient

Allows Secondary Analytics to Scale Out
GATK / HPC
Analytics
Hadoop / Spark
Analytics

Secondary Analytics: Acute Pain Point
FastQ
Reads
Aligned
Reads
Variants
ADAM + Avocado
Matrix rotation
is very I/O
intense
Velvet: Algorithms for de novo short read assembly
using de Bruijn graphs, Zerbino & Birney. 2008
Local de novo
is best…
…only feasible
with efficient
rotations

Columnar Storage => Efficient Rotations
Genome Data
Format Definition
(A 1 Z)
(B 1 Z)
(C 1 Z)
A 1 Z B 1 Z C 1 Z
A B C 1 1 1 Z Z Z
Record 1
Record 2
Record 3
RowBased
ColBased
Sorting
Group
MLLib

Downstream Analytics: GWAS/PheWAS
FastQ
Reads
Aligned
Reads
Variants
Function
Phenotypes
Scalable
GWAS/PheWA
S: “Green
Field” Territory
ADAM + Avocado

Target Application: Alleviate / Prevent Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient

GWAS Overview (Genome-wide Association Study)
• Which genome features are associated with phenotype X?
https://en.wikipedia.org/wiki/Genome-wide_association_study

PheWAS Overview (Phenome-wide …)
• Which phenotypes are associated with genome variant X?
http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/

Genome × Phenome Analysis
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes

Disease Cause via Genome × Phenome Matrix Factorization
• Row Eigenvectors of X represent
– Archetype set of phenotypes
• Column Eigenvectors of Y represent
– Archetype set of genotypes
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Principal
Column
Vector
Archetype
Genotypes
Archetype
Phenotypes
Principal
Row
Vector
Sparse Matrix
Package is Actively
Developed in Spark
Community

Scalable Variant Store => Root out Disease Causes
Model P ~ F(G)
Fortunately, this has already been done…
Genotypes Med Record Phenotypes, e.g.
disease risk, drug response

Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE

Why Create Aadhaar?
• India: 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pay income tax, <20% have bank accounts
– ~800 million mobile, ~200-300 million migrant workers
• Govt. spends about $25-40 billion on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
Standardize identity => Stop leakage

Aadhaar Biometric Capture & Index
Raw
Digital
Fingerprint

Aadhaar Biometric ID Creation
F(x): unique features
G(x): uncommon features
H(x): other features
• 900MM people loaded in 4
years
• In production
– 1MM registrations/day
– 200+ trillion lookups/day
• All built on MapR-DB (HBase)
Low Entropy +
Unique
Low Entropy +
Infrequent

How Does this Relate to Genomics?
F-1(x): common features
Same data shape and size
• Aadhaar: 1B humans, 5MB minutia
• Genome: 7B humans, ~3M variants

How Does this Relate to Genomics?
F-1(x): common features
Phenotype:
healthy or sick?
Phenotype Partition
=>
Low Entropy

≈
individuals
fingerprint minutiae
Find rare minutiae to
uniquely identify
medicalrecords
genetic variants
Find shared variants
to get disease root
cause
Takeaway 1: Don’t reinvent the wheel

Takeaway 2: Evolution, not Revolution
DNA Sequence
NASDAQ
Composite

Thank You

Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17

More Related Content

Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17

Editor's Notes