Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
© 2015 MapR Technologies 1© 2015 MapR Technologies
Hadoop for Genomics: What you need to know
© 2015 MapR Technologies 2
Target Application: Alleviate / Prevent Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2015 MapR Technologies 3
Growth in Resource Capacity
© 2015 MapR Technologies 4
Disruption Circa 2000
NASDAQ
Composite
© 2015 MapR Technologies 5
What Happened?
What did winners
do right to survive
the .com recession?
NASDAQ
Composite
© 2015 MapR Technologies 6
Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Office
© 2015 MapR Technologies 7
Late 1990s: Workload became too big
Storage
read/write
read/write
Website WebsiteWebsite Website
Back Office Back Office
© 2015 MapR Technologies 8
Google Publishes
• 2003: Google Filesystem (aka GFS)
– http://research.google.com/archive/gfs.html
• 2004: MapReduce
– http://research.google.com/archive/mapreduce.html
• 2006: BigTable
– http://research.google.com/archive/bigtable.html
© 2015 MapR Technologies 9
Scale-out with Google FS + MapReduce
read/write
read/write
Website WebsiteWebsite Website
Storage + Compute Cluster
Back Office Back Office
© 2015 MapR Technologies 10
Apache Software Foundation: Fast Follower of Google
MapReduce Hadoop
Google FS
Hadoop FS
BigTable
HBase
© 2015 MapR Technologies 11
DNA Sequencing, post-2004 DNA Sequence
NASDAQ
Composite
© 2015 MapR Technologies 12
DNA Sequencing, pre-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
Sequencer
© 2015 MapR Technologies 13
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
HPC bottleneck
Sequencer
back-pressure
© 2015 MapR Technologies 14
Solution: Implemented 2014 @ Sequencer Vendor
(with MapR)
write-only
DNA Sequencer Cluster (e.g. Illumina X-Ten
Storage + Compute Cluster
Decentralize I/O
Decentralize I/O
© 2015 MapR Technologies 15
Allows Secondary Analytics to Scale Out
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2015 MapR Technologies 16
Allows Secondary Analytics to Scale Out
GATK / HPC
Analytics
Hadoop / Spark
Analytics
© 2015 MapR Technologies 17
Secondary Analytics: Acute Pain Point
FastQ
Reads
Aligned
Reads
Variants
ADAM + Avocado
Matrix rotation
is very I/O
intense
Velvet: Algorithms for de novo short read assembly
using de Bruijn graphs, Zerbino & Birney. 2008
Local de novo
is best…
…only feasible
with efficient
rotations
© 2015 MapR Technologies 18
Columnar Storage => Efficient Rotations
Genome Data
Format Definition
(A 1 Z)
(B 1 Z)
(C 1 Z)
A 1 Z B 1 Z C 1 Z
A B C 1 1 1 Z Z Z
Record 1
Record 2
Record 3
RowBased
ColBased
Sorting
Group
MLLib
© 2015 MapR Technologies 19
Downstream Analytics: GWAS/PheWAS
FastQ
Reads
Aligned
Reads
Variants
Function
Phenotypes
Scalable
GWAS/PheWA
S: “Green
Field” Territory
ADAM + Avocado
© 2015 MapR Technologies 20
Target Application: Alleviate / Prevent Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2015 MapR Technologies 21
GWAS Overview (Genome-wide Association Study)
• Which genome features are associated with phenotype X?
https://en.wikipedia.org/wiki/Genome-wide_association_study
© 2015 MapR Technologies 22
PheWAS Overview (Phenome-wide …)
• Which phenotypes are associated with genome variant X?
http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
© 2015 MapR Technologies 23
Genome × Phenome Analysis
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes
© 2015 MapR Technologies 24
Disease Cause via Genome × Phenome Matrix Factorization
• Row Eigenvectors of X represent
– Archetype set of phenotypes
• Column Eigenvectors of Y represent
– Archetype set of genotypes
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Principal
Column
Vector
Archetype
Genotypes
Archetype
Phenotypes
Principal
Row
Vector
Sparse Matrix
Package is Actively
Developed in Spark
Community
© 2015 MapR Technologies 25
Scalable Variant Store => Root out Disease Causes
Model P ~ F(G)
Fortunately, this has already been done…
Genotypes Med Record Phenotypes, e.g.
disease risk, drug response
© 2015 MapR Technologies 26
Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE
© 2015 MapR Technologies 27
Why Create Aadhaar?
• India: 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pay income tax, <20% have bank accounts
– ~800 million mobile, ~200-300 million migrant workers
• Govt. spends about $25-40 billion on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
Standardize identity => Stop leakage
© 2015 MapR Technologies 28
Aadhaar Biometric Capture & Index
Raw
Digital
Fingerprint
© 2015 MapR Technologies 29
Aadhaar Biometric ID Creation
F(x): unique features
G(x): uncommon features
H(x): other features
• 900MM people loaded in 4
years
• In production
– 1MM registrations/day
– 200+ trillion lookups/day
• All built on MapR-DB (HBase)
Low Entropy +
Unique
Low Entropy +
Infrequent
© 2015 MapR Technologies 30
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Same data shape and size
• Aadhaar: 1B humans, 5MB minutia
• Genome: 7B humans, ~3M variants
© 2015 MapR Technologies 31
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Phenotype:
healthy or sick?
Phenotype Partition
=>
Low Entropy
© 2015 MapR Technologies 32
≈
individuals
fingerprint minutiae
Find rare minutiae to
uniquely identify
medicalrecords
genetic variants
Find shared variants
to get disease root
cause
Takeaway 1: Don’t reinvent the wheel
© 2015 MapR Technologies 33
Takeaway 2: Evolution, not Revolution
DNA Sequence
NASDAQ
Composite
© 2015 MapR Technologies 34
Thank You

More Related Content

Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17

  • 1. © 2015 MapR Technologies 1© 2015 MapR Technologies Hadoop for Genomics: What you need to know
  • 2. © 2015 MapR Technologies 2 Target Application: Alleviate / Prevent Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 3. © 2015 MapR Technologies 3 Growth in Resource Capacity
  • 4. © 2015 MapR Technologies 4 Disruption Circa 2000 NASDAQ Composite
  • 5. © 2015 MapR Technologies 5 What Happened? What did winners do right to survive the .com recession? NASDAQ Composite
  • 6. © 2015 MapR Technologies 6 Early 1990s: Early eCommerce Vendor Setup Storage read/write read/write Website Back Office
  • 7. © 2015 MapR Technologies 7 Late 1990s: Workload became too big Storage read/write read/write Website WebsiteWebsite Website Back Office Back Office
  • 8. © 2015 MapR Technologies 8 Google Publishes • 2003: Google Filesystem (aka GFS) – http://research.google.com/archive/gfs.html • 2004: MapReduce – http://research.google.com/archive/mapreduce.html • 2006: BigTable – http://research.google.com/archive/bigtable.html
  • 9. © 2015 MapR Technologies 9 Scale-out with Google FS + MapReduce read/write read/write Website WebsiteWebsite Website Storage + Compute Cluster Back Office Back Office
  • 10. © 2015 MapR Technologies 10 Apache Software Foundation: Fast Follower of Google MapReduce Hadoop Google FS Hadoop FS BigTable HBase
  • 11. © 2015 MapR Technologies 11 DNA Sequencing, post-2004 DNA Sequence NASDAQ Composite
  • 12. © 2015 MapR Technologies 12 DNA Sequencing, pre-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node Sequencer
  • 13. © 2015 MapR Technologies 13 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten) HPC bottleneck Sequencer back-pressure
  • 14. © 2015 MapR Technologies 14 Solution: Implemented 2014 @ Sequencer Vendor (with MapR) write-only DNA Sequencer Cluster (e.g. Illumina X-Ten Storage + Compute Cluster Decentralize I/O Decentralize I/O
  • 15. © 2015 MapR Technologies 15 Allows Secondary Analytics to Scale Out Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 16. © 2015 MapR Technologies 16 Allows Secondary Analytics to Scale Out GATK / HPC Analytics Hadoop / Spark Analytics
  • 17. © 2015 MapR Technologies 17 Secondary Analytics: Acute Pain Point FastQ Reads Aligned Reads Variants ADAM + Avocado Matrix rotation is very I/O intense Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Zerbino & Birney. 2008 Local de novo is best… …only feasible with efficient rotations
  • 18. © 2015 MapR Technologies 18 Columnar Storage => Efficient Rotations Genome Data Format Definition (A 1 Z) (B 1 Z) (C 1 Z) A 1 Z B 1 Z C 1 Z A B C 1 1 1 Z Z Z Record 1 Record 2 Record 3 RowBased ColBased Sorting Group MLLib
  • 19. © 2015 MapR Technologies 19 Downstream Analytics: GWAS/PheWAS FastQ Reads Aligned Reads Variants Function Phenotypes Scalable GWAS/PheWA S: “Green Field” Territory ADAM + Avocado
  • 20. © 2015 MapR Technologies 20 Target Application: Alleviate / Prevent Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 21. © 2015 MapR Technologies 21 GWAS Overview (Genome-wide Association Study) • Which genome features are associated with phenotype X? https://en.wikipedia.org/wiki/Genome-wide_association_study
  • 22. © 2015 MapR Technologies 22 PheWAS Overview (Phenome-wide …) • Which phenotypes are associated with genome variant X? http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
  • 23. © 2015 MapR Technologies 23 Genome × Phenome Analysis For given population, given SNP 𝛿, and given phenotype ϕ: Count the number of occurrences as the value of the matrix 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 SPARSE Billion + Phenotypes SPARSEBillion+Genotypes
  • 24. © 2015 MapR Technologies 24 Disease Cause via Genome × Phenome Matrix Factorization • Row Eigenvectors of X represent – Archetype set of phenotypes • Column Eigenvectors of Y represent – Archetype set of genotypes 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Principal Column Vector Archetype Genotypes Archetype Phenotypes Principal Row Vector Sparse Matrix Package is Actively Developed in Spark Community
  • 25. © 2015 MapR Technologies 25 Scalable Variant Store => Root out Disease Causes Model P ~ F(G) Fortunately, this has already been done… Genotypes Med Record Phenotypes, e.g. disease risk, drug response
  • 26. © 2015 MapR Technologies 26 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
  • 27. © 2015 MapR Technologies 27 Why Create Aadhaar? • India: 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pay income tax, <20% have bank accounts – ~800 million mobile, ~200-300 million migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40% Standardize identity => Stop leakage
  • 28. © 2015 MapR Technologies 28 Aadhaar Biometric Capture & Index Raw Digital Fingerprint
  • 29. © 2015 MapR Technologies 29 Aadhaar Biometric ID Creation F(x): unique features G(x): uncommon features H(x): other features • 900MM people loaded in 4 years • In production – 1MM registrations/day – 200+ trillion lookups/day • All built on MapR-DB (HBase) Low Entropy + Unique Low Entropy + Infrequent
  • 30. © 2015 MapR Technologies 30 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 7B humans, ~3M variants
  • 31. © 2015 MapR Technologies 31 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Phenotype: healthy or sick? Phenotype Partition => Low Entropy
  • 32. © 2015 MapR Technologies 32 ≈ individuals fingerprint minutiae Find rare minutiae to uniquely identify medicalrecords genetic variants Find shared variants to get disease root cause Takeaway 1: Don’t reinvent the wheel
  • 33. © 2015 MapR Technologies 33 Takeaway 2: Evolution, not Revolution DNA Sequence NASDAQ Composite
  • 34. © 2015 MapR Technologies 34 Thank You

Editor's Notes

  1. 26
  2. Increase GDP by 2%
  3. BOOM LSH