Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
© 2014 MapR Technologies 1 
“…all men are created equal…”
© 2014 Star Wars V: The Empire Strikes Back MapR Technologies 2
© 2014 Dahm Triplets, Playboy 1998 MapR Technologies 3
All Men Are Created Equal 
• Data Scientist: 
“depends on the distance metric” 
© 2014 MapR Technologies 4 
DLaw(xa, xb) 
DGenetic(xa, xb)
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 5 
DGenetic(x, x) = ??? 
Copenhagen, Denmark 
1911
© 2014 MapR Technologies 6 
An Unusual Observation 
• Wilhelm Johannsen. 1911, 
Copenhagen 
– Coins the word “gene” 
• Discovery: sizes are not identical for 
genetically identical peas. 
The Genotype Concept of Heredity. Johannsen, 1911
© 2014 MapR Technologies 7 
An Unusual Observation 
• Wilhelm Johannsen. 1911, 
Copenhagen 
– Coins the word “gene” 
• Discovery: sizes are not identical for 
genetically identical peas. 
DGenetic(x, x) != 0 
The Genotype Concept of Heredity. Johannsen, 1911 
WTF?
© 2014 MapR Technologies 8 
An Unusual Observation 
• Wilhelm Johannsen. 1911, 
Copenhagen 
– Coins the word “gene” 
• Discovery: sizes are not identical for 
genetically identical peas. 
• Introduces new concept: 
– hidden and causal genotype distinct 
from 
– observed phenotype 
The Genotype Concept of Heredity. Johannsen, 1911
© 2014 MapR Technologies 9 
An Unusual Observation 
• Wilhelm Johannsen. 1911, 
Copenhagen 
– Coins the word “gene” 
• Discovery: sizes are not identical for 
genetically identical peas. Further, size 
follows a Gaussian process, where 
P ~ G + E 
Var(P) ~ Var(G) + Var(E) + 2Cov(G, E) 
• This is the basis of quantitative 
genetics 
https://www.youtube.com/watch?v=AjI_LcQOOs4
© 2014 Star Wars: Episode VII MapR Technologies 10
© 2014 Star Wars V: The Empire Strikes Back MapR Technologies 11
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 12 
Renaissance In Medicine: 
Next Generation BigData Workloads
© 2014 The Big Lebowski MapR Technologies 13
1st Renaissance In Medicine: 1400-1700, Europe 
Enabling Factor 
• Movable Type 
© 2014 MapR Technologies 14 
1450, Gutenberg 
• Weak Church 
• Compound Microscope 
1624, Galileo 
• Math-driven Hypotheses 
1687, Newton 
Effect 
• Rapid diffusion of ideas 
• New data sources 
Human dissection 
• Dense, precise data 
Diagrams 
• Paradigm shift in 
reasoning
1st Renaissance In Medicine: 1400-1700, Europe 
Enabling Factor 
• Movable Type 
© 2014 MapR Technologies 15 
1450, Gutenberg 
• Weak Church 
• Compound Microscope 
1624, Galileo 
• Math-driven Hypotheses 
1687, Newton’s H0 
Effect 
• Rapid diffusion of ideas 
• New data sources 
Human dissection 
• Dense, precise metrics 
Diagrams 
• Paradigm shift in 
discovery
© 2014 MapR Technologies 16 
2nd Renaissance In Medicine: 1900- 
Enabling Factor 
• Telecom Networks 
1969, ARPANET. 2008 
• Globalization 
• Next-Gen DNA Sequencer 
1992, Lynx Therapeutics 
• Data-driven Hypotheses 
4th scientific paradigm 
Effect 
• Rapid diffusion of ideas 
• New data sources 
GMOs, stem cells 
• Dense, precise metrics 
Human genomes 
• Paradigm shift in 
discovery
DNA Sequencer – Dense, Precise Metrics 
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 17
Current Generation “Next-Generation” Sequencer 
• Jan 2014 – launch of Illumina XTen @ $10M (cap cost) 
• Produces 6T basepairs / day @ $20K (ops cost) 
© 2014 MapR Technologies 18 
Here’s how it works…
© 2014 MapR Technologies 19
© 2014 MapR Technologies 20
© 2014 MapR Technologies 21
Current Generation “Next-Generation” Sequencer 
• Jan 2014 – launch of Illumina XTen @ $10M (cap cost) 
• Produces 6T basepairs / day @ $20K (ops cost) 
A few other facts… 
★ 300G basepairs / human @ $1000 (medical grade genome) 
★ 4M births / year (2012, United States) 
© 2014 MapR Technologies 22
ILMN HiSeq XTen 
(Jan 2014) 
$1000 Genome 
“Even Moore’s Law” begins in 2004 
© 2014 MapR Technologies 23 
Even Moore’s Law 
Storage: MB/$ 
Stein. 2010. The case for cloud computing in genome informatics 
DNA: bp/$
© 2014 MapR Technologies, confidential 
Impact of XTen on Genomic Medicine 
6T 7K 
HUMANS / YEAR 
$4B 
ALL BABIES’ DNA 
At a cost of $20K At medical grade of 
100x oversampling, 
$1000 / human 
Operating cost of 500 
XTen, 
Capital cost of $5B 
BASEPAIRS / DAY 
…and remember, throughput doubles every 5 months
Ion Torrent 
Technology 
© 2014 MapR Technologies, confidential
© 2014 MapR Technologies, confidential
Nanopore Tech 
* Roche 
* Oxford Nanopore 
© 2014 MapR Technologies, confidential
Mobile Devices? 
IoT DNA Sensors? 
© 2014 MapR Technologies, confidential
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 29 
Social Impact
“The genetic engineers gave him that birthmark 
as part of a sponsorship deal.” 
© 2014 ww wMap.Rt Thechenonlogeiesss.com 30
© 2014 MapR Technologies 31 
Problem Prevention 
Genetically deterministic diseases 
• Muscular dystrophy 
• Cystic fibrosis 
• Albinism 
• Phenylketonuria 
• Hemophilia 
• These are all completely preventable with pre-conception and 
pre-natal screening 
http://www.nature.com/scitable/topicpage/rare-genetic-disorders-learning-about-genetic-disease-979
© 2014 MapR Technologies 32 
Problem Prevention? 
• FACT: US paternity fraud rate is 1 in 20 
http://pandawhale.com/post/13851/my-report-card-came-in-my-paternity-test-came-in 
http://en.wikipedia.org/wiki/Paternity_fraud
Problem? Prevention – More troubling diseases 
• Huntingtons disease – autosomal dominant 
• Inherited cancers – 10% of occurrences (!!!) 
• Total annual cancer spending is $50B over 1M 
© 2014 MapR Technologies 33 
people 
http://www.nature.com/scitable/topicpage/rare-genetic-disorders-learning-about-genetic-disease-979 
http://meps.ahrq.gov/mepsweb/data_files/publications/st80/stat80.pdf
© 2014 MapR Technologies 34 
Genetic Architecture of Cancer Risk 
Inherited Cancer: 
Large proportion of 
incidence and cost 
http://www.cancer.gov/cancertopics/pdq/genetics/overview/healthprofessional/page1/AllPages
Problem? Prevention – More troubling diseases 
• Huntingtons disease – autosomal dominant 
• Inherited cancers – 10% of occurrences (!!!) 
• Total annual cancer spending is $50B over 1M 
© 2014 MapR Technologies 35 
people 
Metrics => Diagnostics => Prognostics (domestication)
Problem? Prevention – More troubling diseases 
• Huntingtons disease – autosomal dominant 
• Inherited cancers – 10% of occurrences (!!!) 
• Total annual cancer spending is $50B over 1M 
Metrics => Diagnostics => Prognostics (domestication) 
Metrics => Diagnostics => Therapeutics (traditional medicine) 
© 2014 MapR Technologies 36 
people
Singapore: Global Leader of Social Engineering 
“Our government wants smart ladies to meet 
smart guys to get smart children.” 
–Director, Club2040 (Singapore matchmaking agency) 
Domestication: the process whereby a population of living 
organisms is changed at the genetic level, through generations of 
selective breeding, to accentuate traits that ultimately benefit the 
interests of humans. 
http://www.nytimes.com/2008/04/29/world/asia/29iht-sing.1.12428974.html 
© 2014 MapR Technologies 37
© 2014 MapR Technologies 38 
Science Fiction-like 
Unprecedented Control 
Dystopia / Eutopia 
Scenarios are Possible
© 2014 MapR Technologies 39 
Many DNA-Based Apps Coming… 
25 
20 
15 
10 
5 
0 
2014 2020 
• 2014: US$ 2B, mostly 
research, mostly 
chemical costs 
• 2020: US$ 20B, 
mostly clinical, mostly 
analytics costs 
Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 
Cinical 
Non-Clinical
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 40 
DNA Diagnostics
© 2014 MapR Technologies 41 
Traditional Therapeutics 
Trial & Error 
http://www.taylorwessing.com/synapse/regulatory_personalised_medicines.html
Unsuitable 
Therapy 
Suitable 
Therapy 
© 2014 MapR Technologies 42 
Personalized Therapeutics 
Trial & Error 
http://www.taylorwessing.com/synapse/regulatory_personalised_medicines.html
© 2014 http://blog.covance.com/tag/companion-diagnostics/ MapR Technologies 43
Let’s look at some real data… 
© 2014 MapR Technologies 44 
http://blog.covance.com/tag/companion-diagnostics/
© 2014 MapR Technologies 45 
Folate (aka vitamin B9, aka glutamate) 
Yes, of Monosodium 
Glutamate (MSG) fame… 
Essential for fetal 
development and brain 
Function 
Also used to detoxify
Folate (aka vitamin B9, aka glutamate) rich foods 
© 2014 MapR Technologies 46 
http://www.easytolovebut.com/?p=2782
Folate (aka vitamin B9, aka glutamate) rich foods 
© 2014 MapR Technologies 47 
http://www.easytolovebut.com/?p=2782
Folate (aka vitamin B9, aka glutamate) rich foods 
© 2014 MapR Technologies 48 
MTHFR is Frequently Mutated 
(50%) 
=> Dysfunction 
http://www.easytolovebut.com/?p=2782
© 2014 MapR Technologies 49
Clinical Sequencing Business Process Workflow 
© 2014 MapR Technologies 50 
Patient Physician 
Clinic 
blood/saliva 
Clinical Lab 
Analytics 
extract
© 2014 MapR Technologies 51 
One Bad MTHFR 
MTHFR C677T 
Methylfolate helps make neurotransmitters in 
your brain. When methylfolate levels are low, 
so are your neurotransmitters. Low production 
of neurotransmitters may cause conditions of 
addictive behavior, depression, anxiety, 
ADHD, mania, irritability, insomnia, learning 
disorders and others. 
Everyone should get tested. Why? Because 1 
in 2 people are affected and if one knows they 
have a MTHFR polymorphism, they know they 
have to be very proactive in taking care of 
themselves. 
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-Link-Between- 
MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-Health.htm 
Pulp Fiction 
Pulp Fiction & 
BMFwallets.com
© 2014 Star Wars: Episode VII MapR Technologies 52
© 2014 MapR Technologies 53
© 2014 MapR Technologies 54
Clinical Sequencing Business Process Workflow 
© 2014 MapR Technologies 55 
Patient Physician 
Clinic 
blood/saliva 
Clinical Lab 
Analytics 
extract
Clinical Genomics, Information Systems Perspective 
© 2014 MapR Technologies 56 
Stakeholder Analyst 
Patient Physician 
Compressed 
Structured 
Base4 Data 
Unstructured 
Base2 Data 
extract 
Base4=>Base2 
Converter 
Reporting and Viz
Clinical Genomics, Information Systems Perspective 
© 2014 MapR Technologies 57 
Stakeholder Analyst 
Patient Physician 
Compressed 
Structured 
Base4 Data 
Unstructured 
Base2 Data 
extract 
Base4=>Base2 
Converter 
Reporting and Viz 
1º analytics 
Base4=>Base2 
Destroys Structure
Base4 => Base2 conversion, DNA fragmentation 
© 2014 MapR Technologies 58 
Genome 
(base4) 
Sequencer 
DNA Fragments 
(base2)
Clinical Genomics, Information Systems Perspective 
2º analytics 
Recovers Structure 
© 2014 MapR Technologies 59 
Stakeholder Analyst 
Patient Physician 
Compressed 
Structured 
Base4 Data 
Unstructured 
Base2 Data 
extract 
Base4=>Base2 
Converter 
Reporting and Viz 
1º analytics 
Base4=>Base2 
Destroys Structure
Clinical Genomics, Information Systems Perspective 
© 2014 MapR Technologies 60 
Stakeholder Analyst 
Patient Physician 
ETL 
Reporting and Viz 
Data Store
Data Scientist 
© 2014 MapR Technologies 61 
Clinical Genomics, Data Science Process 
Stakeholder Analyst 
Patient Physician 
ETL 
Reporting and Viz 
Data Store 
Analytics 
x 
Improve Tests
Improve Tests 
Data Scientist 
© 2014 MapR Technologies 62 
Clinical Genomics, Political Process 
Stakeholder Analyst 
Patient Physician 
ETL 
Reporting and Viz 
Data Store 
Analytics 
x 
Improve Policy 
Public Health System
© 2014 MapR Technologies 63 
Many DNA-Based Apps Coming… 
25 
20 
15 
10 
5 
0 
2014 2020 
• 2014: US$ 2B, mostly 
research, mostly 
chemical costs 
• 2020: US$ 20B, 
mostly clinical, mostly 
analytics costs 
Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 
Cinical 
Non-Clinical
© 2014 MapR Technologies 64 
Clinical Genomics, Data Science Process 
Stakeholder Analyst 
Patient Physician 
ETL 
Reporting and Viz 
Data Store 
Analytics 
Secondary 
Analytics 
Downstream 
Analytics 
Experiment 
Design 
DNA 
Sequencing
© 2014 MapR Technologies 65 
DNA Sequencing Value Chain 
100 
% Effort 
0 
Pre-NGS 
~2000 
Future 
~2020 
Now 
Experiment 
Design 
DNA 
Sequencing 
Secondary 
Analytics 
Downstream 
Analytics 
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Sequence is Becoming Free (Even Moore’s Law) 
© 2014 MapR Technologies 66 
Commoditization Pattern 
Huge influx of inexpensive data 
Creates new medical and biotech use-cases 
100 
% Effort 
0 
Pre-NGS 
~2000 
Future 
~ 
Now
Experiment Design and Downstream Analytics 
© 2014 MapR Technologies 67 
Specialization will grow to 100% effort 
This is the desirable scenario 
Biologists ought to be doing biology 
100 
% Effort 
0 
Pre-NGS 
~2000 
Future 
~ 
Now 
Downstream 
Analytics
© 2014 MapR Technologies 68 
BigData Bottleneck 
Time currently being spent on BigData problems 
Not ideal 
100 
% Effort 
0 
Pre-NGS 
~2000 
Future 
~ 
Now 
Physicians & Biologists need help from CS & SW Engineers
Optimize 
This 
© 2014 MapR Technologies 69 
BigData Bottleneck 
Time currently being spent on BigData problems 
Not ideal 
100 
% Effort 
0 
Pre-NGS 
~2000 
Future 
~ 
Now 
Physicians & Biologists need help from CS & SW Engineers
2º Analytics (fragment re-assembly), Quick Overview 
Some of these differences occur in the personal DNA… http://www.bioinfor.© 2014 com/MapR zoom/Technologies support/tutorials.html 
70
© 2014 MapR Technologies 71 
Some Errors come from the Sequencer 
• This is why a “medical grade” genome is 100x oversampled 
http://www.medicinalgenomics.com/dna-assembly/
© 2014 MapR Technologies 72 
Goal: Find the Real Differences 
Some of these differences occur in the personal DNA… http://www.bioinfor.com/zoom/support/tutorials.html
1. What is the (Probable) Color of Each Column? 
© 2014 MapR Technologies 73
© 2014 MapR Technologies 74 
2. Is the Probable Color Surprising? 
Expected
Row-major strategies: inefficient with CPU or Mem 
Expected 
© 2014 MapR Technologies 75
© 2014 MapR Technologies 76 
Strategy 1: foreach column, foreach row 
O(rows*cols) + Expected 
O(1 col) mem
Strategy 2: foreach row, keep running tallies 
O(rows) + Expected 
O(rows*cols) mem 
© 2014 MapR Technologies 77
Strategy 3: rotate => foreach (transposed) column 
© 2014 MapR Technologies 78 
Expected 
O(rows log rows) + 
O(cols) + 
O(1 col) mem 
REQUIRES SORT 
Personal variants
Fortunately, Hadoop is Really Good at Sorting 
© 2014 MapR Technologies 79
Comparison of Strategies 
Strategy 1 Strategy 2 Strategy 3 
O(rows*cols) 
© 2014 MapR Technologies 80 
+ O(1 col) mem 
O(rows) 
+ O(rows*cols) mem 
O(rows log rows) 
+ O(cols) 
+ O(1 col) mem 
Compute Intense 
Memory Efficient 
Compute Efficient 
Memory Intense 
Compute Efficient 
Memory Efficient
Comparison of Strategies 
Strategy 1 Strategy 2 Strategy 3 
O(rows*cols) 
© 2014 MapR Technologies 81 
+ O(1 col) mem 
O(rows) 
+ O(rows*cols) mem 
O(rows log rows) ÷ shards 
+ O(cols) ÷ shards 
+ O(1 col) mem 
OK for small matrixes 
Infinitely attractive for 
infinitely large matrixes
Comparison of Strategies 
Strategy 1 Strategy 2 Strategy 3 
O(rows*cols) 
© 2014 MapR Technologies 82 
+ O(1 col) mem 
O(rows) 
+ O(rows*cols) mem 
O(rows log rows) ÷ shards 
+ O(cols) ÷ shards 
+ O(1 col) mem 
Distributed matrix math techniques 
are common in tech, but not 
common in genomics 
Some commercial leaders: Seven 
Bridges Genomics, Ion Flux 
Twitter Algebird: 
Matrix math on Hadoop
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 83 
Paradigm Shift in Discovery
All next-gen drugs will require a “companion diagnostic”… 
…as part of Rx to determine your personal response segment 
© 2014 MapR Technologies 84 
Patient Physician 
Patient Physician 
ETL 
Reporting and Viz 
Data Store
© 2014 MapR Technologies 85 
Personalized Therapeutics 
Unsuitable 
Therapy 
Suitable 
Therapy 
http://www.taylorwessing.com/synapse/regulatory_personalised_medicines.html
How to Determine Response Segment? EHR Mining 
Building the graph of medicine from millions of clinical narratives. Finlayson, 2014 
© 2014 MapR Technologies 86
How to Determine Response Segment? EHR Mining 
Metrics => Diagnostics => Prognostics 
Metrics => Diagnostics => Therapeutics 
© 2014 MapR Technologies 87
How to Determine Response Segment? EHR Mining 
Metrics => Diagnostics => Prognostics 
Metrics => Diagnostics => Therapeutics 
© 2014 MapR Technologies 88
How to Determine Response Segment? EHR Mining 
© 2014 MapR Technologies 89
Personal Genome in EHR => Better Therapeutics 
© 2014 MapR Technologies 90 
Personal Genome
Personal Genome in EHR => Prognostic Medicine 
© 2014 MapR Technologies 91 
Personal Genome 
>Predict personal outcome 
>Preventative action
© 2014 MapR Technologies 92 
How to Improve? 
http://www.cancer.gov/cancertopics/pdq/genetics/overview/healthprofessional/page1/AllPages
Strategy 1: Improve Diagnostics & Therapeutics 
© 2014 MapR Technologies 93 
Improve Health with 
Personalized Therapy 
http://www.cancer.gov/cancertopics/pdq/genetics/overview/healthprofessional/page1/AllPages
Strategy 2: Improve Prognostics => Domesticate 
© 2014 MapR Technologies 94 
Reshape / Remove
Strategy 2: Improve Prognostics => Domestication 
© 2014 MapR Technologies 95 
http://www.fao.org/docrep/field/009/v8720e/v8720e04.htm
Strategy 2: Improve Prognostics => Domestication 
Beware: 
Law of Unintended 
Consequences 
From: 
The Unanticipated 
Consequences of 
Purposive Social 
Action” 
Merton, 1936 
© 2014 MapR Technologies 96 
http://www.fao.org/docrep/field/009/v8720e/v8720e04.htm
© 2014 Star Wars V: The Empire Strikes Back MapR Technologies 97
© 2014 MapR Technologies 98 
“When Racism was Science”, 10/13 in 
• Current Exhibit at NYU (thru March 2015) 
• Reproduction of the “Eugenics Record Office” 
• @ Cold Spring Harbor Laboratory (30min from NYC) 
• Supported by many social progressives (Carnegie Institute, 
Rockefeller family, Harvard) 
• “There were many prominent New Yorkers involved in eugenics […] It 
was initially about how to become more efficient as a modern 
society. 
• Today, viewed as a scientific disgrace by Americans. What about 
later? 
http://www.nytimes.com/2014/10/14/science/haunted-files-the-eugenics-record-office-recreates-a-dark-time-in-a-laboratorys-past.html
© 2014 MapR Technologies 99 
How to Architect a Solution? 
http://www.cancer.gov/cancertopics/pdq/genetics/overview/healthprofessional/page1/AllPages
© 2014 MapR Technologies 100 
How to Architect a Solution? 
Personal Genome 
Predict personal outcome
© 2014 MapR Technologies 101 
How to Architect a Solution? 
Personal Genome 
Predict personal outcome 
Let’s look at a 
existing system 
that’s similar…
Largest Biometric Database in the World 
© 2014 MapR Technologies 102 
1.2B 
PEOPLE 
PEOPLE
© 2014 MapR Technologies 103 
Why Create Aadhaar? 
• India: 1.2 billion residents 
– 640,000 villages, ~60% lives under $2/day 
– ~75% literacy, <3% pay income tax, <20% have bank accounts 
– ~800 million mobile, ~200-300 million migrant workers 
• Govt. spends about $25-40 billion on direct subsidies 
– Residents have no standard identity document 
– Most programs plagued with ghost and multiple identities causing 
leakage of 30-40%
© 2014 MapR Technologies 104 
Why Create Aadhaar? 
• India: 1.2 billion residents 
– 640,000 villages, ~60% lives under $2/day 
– ~75% literacy, <3% pay income tax, <20% have bank accounts 
– ~800 million mobile, ~200-300 million migrant workers 
• Govt. spends about $25-40 billion on direct subsidies 
– Residents have no standard identity document 
– Most programs plagued with ghost and multiple identities causing 
leakage of 30-40% 
Standardize identity => Stop leakage
© 2014 MapR Technologies 105 
Aadhaar Biometric Capture & Index 
Raw 
Digital 
Fingerprint
© 2014 MapR Technologies 106 
Aadhaar Biometric ID Creation 
F(x): unique features 
G(x): uncommon features 
H(x): other features 
• 600 to 800 million people 
loaded in 4 years 
• In production 
– 1 million registrations/day 
– 200+ trillion lookups/day 
• All built on MapR-DB (HBase)
© 2014 MapR Technologies 107 
Aadhaar Biometric ID Creation 
F(x): unique features 
G(x): uncommon features 
H(x): other features 
• 600 to 800 million people 
loaded in 4 years 
• In production 
– 1 million registrations/day 
– 200+ trillion lookups/day 
• All built on MapR-DB (HBase)
© 2014 MapR Technologies 108 
How Does this Relate to Genomics? 
F(x): unique features 
G(x): uncommon features 
H(x): other features 
Same data shape and size 
• Aadhaar: 1B humans, 5MB minutia 
• Genome: 6B humans, ~3M variants
© 2014 MapR Technologies 109 
How Does this Relate to Genomics? 
Same data set operations 
F-1(x): common features 
F(x): unique features 
G(x): uncommon features 
H(x): other features 
Same data shape and size 
• Aadhaar: 1B humans, 5MB minutia 
• Genome: 6B humans, ~3M variants 
• Genome: variant × phenotype 
• Common variant => causal gene 
F-1(x) !
© 2014 MapR Technologies 110 
How Does this Relate to Genomics? 
Same data set operations 
F-1(x): common features 
F(x): unique features 
Same set operations 
Same data access pattern 
Same algorithm / memory complexity
© 2014 MapR Technologies 111 
How Does this Relate to Genomics? 
Join and Regress 
Phenotype ~ Genotype 
Generate new Hypotheses
There’s Something Bigger Happening… 
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 112
Medical Renaissance is Part of a Paradigm Shift 
© 2014 MapR Technologies 113 
• The Structure of Scientific Revolutions, Kuhn, 1962 
http://edtosavetheworld.com/2014/05/01/change-theory-do-we-need-a-new-paradigm/
© 2014 MapR Technologies 114 
1st Paradigm: Observational (pre)Science 
measurement 
1st paradigm: observational science 
description 
observation theory
simulation 
© 2014 MapR Technologies 115 
2nd Paradigm: Theoretical (pre)Science 
measurement 
1st paradigm: observational science 
description 
observation theory 
explanation 
2nd paradigm: theoretical science 
mathematics 
computing
simulation 
© 2014 MapR Technologies 116 
3rd Paradigm: Experimental Science 
measurement 
1st paradigm: observational science 
description 
observation theory 
explanation 
2nd paradigm: theoretical science 
mathematics 
computing 
directed control channel 
observation 
3rd paradigm: 
experimental science
Medical Renaissance is Part of a Paradigm Shift 
© 2014 MapR Technologies 117 
• The Structure of Scientific Revolutions, Kuhn, 1962
4th Paradigm: eScience, enabled by BigData 
explore 
simulate 
explain 
© 2014 MapR Technologies 118 
4th paradigm: eScience 
accumulated knowledge 
computing 
big data engine => “ escape velocity”
© 2014 http://en.wikipedia.org/wiki/The_Structure_of_ MSapRc Tieechnnotliofgiices_Revolution1s19 
We are 
here
© 2014 MapR Technologies, confidential 
Thanks! 
Questions? 
@allenday, @mapr 
linkedin.com/in/allenday 
aday@mapr.com 
“Good news – I hear the paradigm is shifting” 
~ New Yorker, Charles Barsotti

More Related Content

Renaissance in Medicine - Strata - NoSQL and Genomics

  • 1. © 2014 MapR Technologies 1 “…all men are created equal…”
  • 2. © 2014 Star Wars V: The Empire Strikes Back MapR Technologies 2
  • 3. © 2014 Dahm Triplets, Playboy 1998 MapR Technologies 3
  • 4. All Men Are Created Equal • Data Scientist: “depends on the distance metric” © 2014 MapR Technologies 4 DLaw(xa, xb) DGenetic(xa, xb)
  • 5. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 5 DGenetic(x, x) = ??? Copenhagen, Denmark 1911
  • 6. © 2014 MapR Technologies 6 An Unusual Observation • Wilhelm Johannsen. 1911, Copenhagen – Coins the word “gene” • Discovery: sizes are not identical for genetically identical peas. The Genotype Concept of Heredity. Johannsen, 1911
  • 7. © 2014 MapR Technologies 7 An Unusual Observation • Wilhelm Johannsen. 1911, Copenhagen – Coins the word “gene” • Discovery: sizes are not identical for genetically identical peas. DGenetic(x, x) != 0 The Genotype Concept of Heredity. Johannsen, 1911 WTF?
  • 8. © 2014 MapR Technologies 8 An Unusual Observation • Wilhelm Johannsen. 1911, Copenhagen – Coins the word “gene” • Discovery: sizes are not identical for genetically identical peas. • Introduces new concept: – hidden and causal genotype distinct from – observed phenotype The Genotype Concept of Heredity. Johannsen, 1911
  • 9. © 2014 MapR Technologies 9 An Unusual Observation • Wilhelm Johannsen. 1911, Copenhagen – Coins the word “gene” • Discovery: sizes are not identical for genetically identical peas. Further, size follows a Gaussian process, where P ~ G + E Var(P) ~ Var(G) + Var(E) + 2Cov(G, E) • This is the basis of quantitative genetics https://www.youtube.com/watch?v=AjI_LcQOOs4
  • 10. © 2014 Star Wars: Episode VII MapR Technologies 10
  • 11. © 2014 Star Wars V: The Empire Strikes Back MapR Technologies 11
  • 12. © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 12 Renaissance In Medicine: Next Generation BigData Workloads
  • 13. © 2014 The Big Lebowski MapR Technologies 13
  • 14. 1st Renaissance In Medicine: 1400-1700, Europe Enabling Factor • Movable Type © 2014 MapR Technologies 14 1450, Gutenberg • Weak Church • Compound Microscope 1624, Galileo • Math-driven Hypotheses 1687, Newton Effect • Rapid diffusion of ideas • New data sources Human dissection • Dense, precise data Diagrams • Paradigm shift in reasoning
  • 15. 1st Renaissance In Medicine: 1400-1700, Europe Enabling Factor • Movable Type © 2014 MapR Technologies 15 1450, Gutenberg • Weak Church • Compound Microscope 1624, Galileo • Math-driven Hypotheses 1687, Newton’s H0 Effect • Rapid diffusion of ideas • New data sources Human dissection • Dense, precise metrics Diagrams • Paradigm shift in discovery
  • 16. © 2014 MapR Technologies 16 2nd Renaissance In Medicine: 1900- Enabling Factor • Telecom Networks 1969, ARPANET. 2008 • Globalization • Next-Gen DNA Sequencer 1992, Lynx Therapeutics • Data-driven Hypotheses 4th scientific paradigm Effect • Rapid diffusion of ideas • New data sources GMOs, stem cells • Dense, precise metrics Human genomes • Paradigm shift in discovery
  • 17. DNA Sequencer – Dense, Precise Metrics © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 17
  • 18. Current Generation “Next-Generation” Sequencer • Jan 2014 – launch of Illumina XTen @ $10M (cap cost) • Produces 6T basepairs / day @ $20K (ops cost) © 2014 MapR Technologies 18 Here’s how it works…
  • 19. © 2014 MapR Technologies 19
  • 20. © 2014 MapR Technologies 20
  • 21. © 2014 MapR Technologies 21
  • 22. Current Generation “Next-Generation” Sequencer • Jan 2014 – launch of Illumina XTen @ $10M (cap cost) • Produces 6T basepairs / day @ $20K (ops cost) A few other facts… ★ 300G basepairs / human @ $1000 (medical grade genome) ★ 4M births / year (2012, United States) © 2014 MapR Technologies 22
  • 23. ILMN HiSeq XTen (Jan 2014) $1000 Genome “Even Moore’s Law” begins in 2004 © 2014 MapR Technologies 23 Even Moore’s Law Storage: MB/$ Stein. 2010. The case for cloud computing in genome informatics DNA: bp/$
  • 24. © 2014 MapR Technologies, confidential Impact of XTen on Genomic Medicine 6T 7K HUMANS / YEAR $4B ALL BABIES’ DNA At a cost of $20K At medical grade of 100x oversampling, $1000 / human Operating cost of 500 XTen, Capital cost of $5B BASEPAIRS / DAY …and remember, throughput doubles every 5 months
  • 25. Ion Torrent Technology © 2014 MapR Technologies, confidential
  • 26. © 2014 MapR Technologies, confidential
  • 27. Nanopore Tech * Roche * Oxford Nanopore © 2014 MapR Technologies, confidential
  • 28. Mobile Devices? IoT DNA Sensors? © 2014 MapR Technologies, confidential
  • 29. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 29 Social Impact
  • 30. “The genetic engineers gave him that birthmark as part of a sponsorship deal.” © 2014 ww wMap.Rt Thechenonlogeiesss.com 30
  • 31. © 2014 MapR Technologies 31 Problem Prevention Genetically deterministic diseases • Muscular dystrophy • Cystic fibrosis • Albinism • Phenylketonuria • Hemophilia • These are all completely preventable with pre-conception and pre-natal screening http://www.nature.com/scitable/topicpage/rare-genetic-disorders-learning-about-genetic-disease-979
  • 32. © 2014 MapR Technologies 32 Problem Prevention? • FACT: US paternity fraud rate is 1 in 20 http://pandawhale.com/post/13851/my-report-card-came-in-my-paternity-test-came-in http://en.wikipedia.org/wiki/Paternity_fraud
  • 33. Problem? Prevention – More troubling diseases • Huntingtons disease – autosomal dominant • Inherited cancers – 10% of occurrences (!!!) • Total annual cancer spending is $50B over 1M © 2014 MapR Technologies 33 people http://www.nature.com/scitable/topicpage/rare-genetic-disorders-learning-about-genetic-disease-979 http://meps.ahrq.gov/mepsweb/data_files/publications/st80/stat80.pdf
  • 34. © 2014 MapR Technologies 34 Genetic Architecture of Cancer Risk Inherited Cancer: Large proportion of incidence and cost http://www.cancer.gov/cancertopics/pdq/genetics/overview/healthprofessional/page1/AllPages
  • 35. Problem? Prevention – More troubling diseases • Huntingtons disease – autosomal dominant • Inherited cancers – 10% of occurrences (!!!) • Total annual cancer spending is $50B over 1M © 2014 MapR Technologies 35 people Metrics => Diagnostics => Prognostics (domestication)
  • 36. Problem? Prevention – More troubling diseases • Huntingtons disease – autosomal dominant • Inherited cancers – 10% of occurrences (!!!) • Total annual cancer spending is $50B over 1M Metrics => Diagnostics => Prognostics (domestication) Metrics => Diagnostics => Therapeutics (traditional medicine) © 2014 MapR Technologies 36 people
  • 37. Singapore: Global Leader of Social Engineering “Our government wants smart ladies to meet smart guys to get smart children.” –Director, Club2040 (Singapore matchmaking agency) Domestication: the process whereby a population of living organisms is changed at the genetic level, through generations of selective breeding, to accentuate traits that ultimately benefit the interests of humans. http://www.nytimes.com/2008/04/29/world/asia/29iht-sing.1.12428974.html © 2014 MapR Technologies 37
  • 38. © 2014 MapR Technologies 38 Science Fiction-like Unprecedented Control Dystopia / Eutopia Scenarios are Possible
  • 39. © 2014 MapR Technologies 39 Many DNA-Based Apps Coming… 25 20 15 10 5 0 2014 2020 • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning Cinical Non-Clinical
  • 40. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 40 DNA Diagnostics
  • 41. © 2014 MapR Technologies 41 Traditional Therapeutics Trial & Error http://www.taylorwessing.com/synapse/regulatory_personalised_medicines.html
  • 42. Unsuitable Therapy Suitable Therapy © 2014 MapR Technologies 42 Personalized Therapeutics Trial & Error http://www.taylorwessing.com/synapse/regulatory_personalised_medicines.html
  • 44. Let’s look at some real data… © 2014 MapR Technologies 44 http://blog.covance.com/tag/companion-diagnostics/
  • 45. © 2014 MapR Technologies 45 Folate (aka vitamin B9, aka glutamate) Yes, of Monosodium Glutamate (MSG) fame… Essential for fetal development and brain Function Also used to detoxify
  • 46. Folate (aka vitamin B9, aka glutamate) rich foods © 2014 MapR Technologies 46 http://www.easytolovebut.com/?p=2782
  • 47. Folate (aka vitamin B9, aka glutamate) rich foods © 2014 MapR Technologies 47 http://www.easytolovebut.com/?p=2782
  • 48. Folate (aka vitamin B9, aka glutamate) rich foods © 2014 MapR Technologies 48 MTHFR is Frequently Mutated (50%) => Dysfunction http://www.easytolovebut.com/?p=2782
  • 49. © 2014 MapR Technologies 49
  • 50. Clinical Sequencing Business Process Workflow © 2014 MapR Technologies 50 Patient Physician Clinic blood/saliva Clinical Lab Analytics extract
  • 51. © 2014 MapR Technologies 51 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-Link-Between- MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-Health.htm Pulp Fiction Pulp Fiction & BMFwallets.com
  • 52. © 2014 Star Wars: Episode VII MapR Technologies 52
  • 53. © 2014 MapR Technologies 53
  • 54. © 2014 MapR Technologies 54
  • 55. Clinical Sequencing Business Process Workflow © 2014 MapR Technologies 55 Patient Physician Clinic blood/saliva Clinical Lab Analytics extract
  • 56. Clinical Genomics, Information Systems Perspective © 2014 MapR Technologies 56 Stakeholder Analyst Patient Physician Compressed Structured Base4 Data Unstructured Base2 Data extract Base4=>Base2 Converter Reporting and Viz
  • 57. Clinical Genomics, Information Systems Perspective © 2014 MapR Technologies 57 Stakeholder Analyst Patient Physician Compressed Structured Base4 Data Unstructured Base2 Data extract Base4=>Base2 Converter Reporting and Viz 1º analytics Base4=>Base2 Destroys Structure
  • 58. Base4 => Base2 conversion, DNA fragmentation © 2014 MapR Technologies 58 Genome (base4) Sequencer DNA Fragments (base2)
  • 59. Clinical Genomics, Information Systems Perspective 2º analytics Recovers Structure © 2014 MapR Technologies 59 Stakeholder Analyst Patient Physician Compressed Structured Base4 Data Unstructured Base2 Data extract Base4=>Base2 Converter Reporting and Viz 1º analytics Base4=>Base2 Destroys Structure
  • 60. Clinical Genomics, Information Systems Perspective © 2014 MapR Technologies 60 Stakeholder Analyst Patient Physician ETL Reporting and Viz Data Store
  • 61. Data Scientist © 2014 MapR Technologies 61 Clinical Genomics, Data Science Process Stakeholder Analyst Patient Physician ETL Reporting and Viz Data Store Analytics x Improve Tests
  • 62. Improve Tests Data Scientist © 2014 MapR Technologies 62 Clinical Genomics, Political Process Stakeholder Analyst Patient Physician ETL Reporting and Viz Data Store Analytics x Improve Policy Public Health System
  • 63. © 2014 MapR Technologies 63 Many DNA-Based Apps Coming… 25 20 15 10 5 0 2014 2020 • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning Cinical Non-Clinical
  • 64. © 2014 MapR Technologies 64 Clinical Genomics, Data Science Process Stakeholder Analyst Patient Physician ETL Reporting and Viz Data Store Analytics Secondary Analytics Downstream Analytics Experiment Design DNA Sequencing
  • 65. © 2014 MapR Technologies 65 DNA Sequencing Value Chain 100 % Effort 0 Pre-NGS ~2000 Future ~2020 Now Experiment Design DNA Sequencing Secondary Analytics Downstream Analytics Sboner, et al, 2011. The real cost of sequencing: higher than you think!
  • 66. Sequence is Becoming Free (Even Moore’s Law) © 2014 MapR Technologies 66 Commoditization Pattern Huge influx of inexpensive data Creates new medical and biotech use-cases 100 % Effort 0 Pre-NGS ~2000 Future ~ Now
  • 67. Experiment Design and Downstream Analytics © 2014 MapR Technologies 67 Specialization will grow to 100% effort This is the desirable scenario Biologists ought to be doing biology 100 % Effort 0 Pre-NGS ~2000 Future ~ Now Downstream Analytics
  • 68. © 2014 MapR Technologies 68 BigData Bottleneck Time currently being spent on BigData problems Not ideal 100 % Effort 0 Pre-NGS ~2000 Future ~ Now Physicians & Biologists need help from CS & SW Engineers
  • 69. Optimize This © 2014 MapR Technologies 69 BigData Bottleneck Time currently being spent on BigData problems Not ideal 100 % Effort 0 Pre-NGS ~2000 Future ~ Now Physicians & Biologists need help from CS & SW Engineers
  • 70. 2º Analytics (fragment re-assembly), Quick Overview Some of these differences occur in the personal DNA… http://www.bioinfor.© 2014 com/MapR zoom/Technologies support/tutorials.html 70
  • 71. © 2014 MapR Technologies 71 Some Errors come from the Sequencer • This is why a “medical grade” genome is 100x oversampled http://www.medicinalgenomics.com/dna-assembly/
  • 72. © 2014 MapR Technologies 72 Goal: Find the Real Differences Some of these differences occur in the personal DNA… http://www.bioinfor.com/zoom/support/tutorials.html
  • 73. 1. What is the (Probable) Color of Each Column? © 2014 MapR Technologies 73
  • 74. © 2014 MapR Technologies 74 2. Is the Probable Color Surprising? Expected
  • 75. Row-major strategies: inefficient with CPU or Mem Expected © 2014 MapR Technologies 75
  • 76. © 2014 MapR Technologies 76 Strategy 1: foreach column, foreach row O(rows*cols) + Expected O(1 col) mem
  • 77. Strategy 2: foreach row, keep running tallies O(rows) + Expected O(rows*cols) mem © 2014 MapR Technologies 77
  • 78. Strategy 3: rotate => foreach (transposed) column © 2014 MapR Technologies 78 Expected O(rows log rows) + O(cols) + O(1 col) mem REQUIRES SORT Personal variants
  • 79. Fortunately, Hadoop is Really Good at Sorting © 2014 MapR Technologies 79
  • 80. Comparison of Strategies Strategy 1 Strategy 2 Strategy 3 O(rows*cols) © 2014 MapR Technologies 80 + O(1 col) mem O(rows) + O(rows*cols) mem O(rows log rows) + O(cols) + O(1 col) mem Compute Intense Memory Efficient Compute Efficient Memory Intense Compute Efficient Memory Efficient
  • 81. Comparison of Strategies Strategy 1 Strategy 2 Strategy 3 O(rows*cols) © 2014 MapR Technologies 81 + O(1 col) mem O(rows) + O(rows*cols) mem O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) mem OK for small matrixes Infinitely attractive for infinitely large matrixes
  • 82. Comparison of Strategies Strategy 1 Strategy 2 Strategy 3 O(rows*cols) © 2014 MapR Technologies 82 + O(1 col) mem O(rows) + O(rows*cols) mem O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) mem Distributed matrix math techniques are common in tech, but not common in genomics Some commercial leaders: Seven Bridges Genomics, Ion Flux Twitter Algebird: Matrix math on Hadoop
  • 83. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 83 Paradigm Shift in Discovery
  • 84. All next-gen drugs will require a “companion diagnostic”… …as part of Rx to determine your personal response segment © 2014 MapR Technologies 84 Patient Physician Patient Physician ETL Reporting and Viz Data Store
  • 85. © 2014 MapR Technologies 85 Personalized Therapeutics Unsuitable Therapy Suitable Therapy http://www.taylorwessing.com/synapse/regulatory_personalised_medicines.html
  • 86. How to Determine Response Segment? EHR Mining Building the graph of medicine from millions of clinical narratives. Finlayson, 2014 © 2014 MapR Technologies 86
  • 87. How to Determine Response Segment? EHR Mining Metrics => Diagnostics => Prognostics Metrics => Diagnostics => Therapeutics © 2014 MapR Technologies 87
  • 88. How to Determine Response Segment? EHR Mining Metrics => Diagnostics => Prognostics Metrics => Diagnostics => Therapeutics © 2014 MapR Technologies 88
  • 89. How to Determine Response Segment? EHR Mining © 2014 MapR Technologies 89
  • 90. Personal Genome in EHR => Better Therapeutics © 2014 MapR Technologies 90 Personal Genome
  • 91. Personal Genome in EHR => Prognostic Medicine © 2014 MapR Technologies 91 Personal Genome >Predict personal outcome >Preventative action
  • 92. © 2014 MapR Technologies 92 How to Improve? http://www.cancer.gov/cancertopics/pdq/genetics/overview/healthprofessional/page1/AllPages
  • 93. Strategy 1: Improve Diagnostics & Therapeutics © 2014 MapR Technologies 93 Improve Health with Personalized Therapy http://www.cancer.gov/cancertopics/pdq/genetics/overview/healthprofessional/page1/AllPages
  • 94. Strategy 2: Improve Prognostics => Domesticate © 2014 MapR Technologies 94 Reshape / Remove
  • 95. Strategy 2: Improve Prognostics => Domestication © 2014 MapR Technologies 95 http://www.fao.org/docrep/field/009/v8720e/v8720e04.htm
  • 96. Strategy 2: Improve Prognostics => Domestication Beware: Law of Unintended Consequences From: The Unanticipated Consequences of Purposive Social Action” Merton, 1936 © 2014 MapR Technologies 96 http://www.fao.org/docrep/field/009/v8720e/v8720e04.htm
  • 97. © 2014 Star Wars V: The Empire Strikes Back MapR Technologies 97
  • 98. © 2014 MapR Technologies 98 “When Racism was Science”, 10/13 in • Current Exhibit at NYU (thru March 2015) • Reproduction of the “Eugenics Record Office” • @ Cold Spring Harbor Laboratory (30min from NYC) • Supported by many social progressives (Carnegie Institute, Rockefeller family, Harvard) • “There were many prominent New Yorkers involved in eugenics […] It was initially about how to become more efficient as a modern society. • Today, viewed as a scientific disgrace by Americans. What about later? http://www.nytimes.com/2014/10/14/science/haunted-files-the-eugenics-record-office-recreates-a-dark-time-in-a-laboratorys-past.html
  • 99. © 2014 MapR Technologies 99 How to Architect a Solution? http://www.cancer.gov/cancertopics/pdq/genetics/overview/healthprofessional/page1/AllPages
  • 100. © 2014 MapR Technologies 100 How to Architect a Solution? Personal Genome Predict personal outcome
  • 101. © 2014 MapR Technologies 101 How to Architect a Solution? Personal Genome Predict personal outcome Let’s look at a existing system that’s similar…
  • 102. Largest Biometric Database in the World © 2014 MapR Technologies 102 1.2B PEOPLE PEOPLE
  • 103. © 2014 MapR Technologies 103 Why Create Aadhaar? • India: 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pay income tax, <20% have bank accounts – ~800 million mobile, ~200-300 million migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40%
  • 104. © 2014 MapR Technologies 104 Why Create Aadhaar? • India: 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pay income tax, <20% have bank accounts – ~800 million mobile, ~200-300 million migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40% Standardize identity => Stop leakage
  • 105. © 2014 MapR Technologies 105 Aadhaar Biometric Capture & Index Raw Digital Fingerprint
  • 106. © 2014 MapR Technologies 106 Aadhaar Biometric ID Creation F(x): unique features G(x): uncommon features H(x): other features • 600 to 800 million people loaded in 4 years • In production – 1 million registrations/day – 200+ trillion lookups/day • All built on MapR-DB (HBase)
  • 107. © 2014 MapR Technologies 107 Aadhaar Biometric ID Creation F(x): unique features G(x): uncommon features H(x): other features • 600 to 800 million people loaded in 4 years • In production – 1 million registrations/day – 200+ trillion lookups/day • All built on MapR-DB (HBase)
  • 108. © 2014 MapR Technologies 108 How Does this Relate to Genomics? F(x): unique features G(x): uncommon features H(x): other features Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 6B humans, ~3M variants
  • 109. © 2014 MapR Technologies 109 How Does this Relate to Genomics? Same data set operations F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 6B humans, ~3M variants • Genome: variant × phenotype • Common variant => causal gene F-1(x) !
  • 110. © 2014 MapR Technologies 110 How Does this Relate to Genomics? Same data set operations F-1(x): common features F(x): unique features Same set operations Same data access pattern Same algorithm / memory complexity
  • 111. © 2014 MapR Technologies 111 How Does this Relate to Genomics? Join and Regress Phenotype ~ Genotype Generate new Hypotheses
  • 112. There’s Something Bigger Happening… © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 112
  • 113. Medical Renaissance is Part of a Paradigm Shift © 2014 MapR Technologies 113 • The Structure of Scientific Revolutions, Kuhn, 1962 http://edtosavetheworld.com/2014/05/01/change-theory-do-we-need-a-new-paradigm/
  • 114. © 2014 MapR Technologies 114 1st Paradigm: Observational (pre)Science measurement 1st paradigm: observational science description observation theory
  • 115. simulation © 2014 MapR Technologies 115 2nd Paradigm: Theoretical (pre)Science measurement 1st paradigm: observational science description observation theory explanation 2nd paradigm: theoretical science mathematics computing
  • 116. simulation © 2014 MapR Technologies 116 3rd Paradigm: Experimental Science measurement 1st paradigm: observational science description observation theory explanation 2nd paradigm: theoretical science mathematics computing directed control channel observation 3rd paradigm: experimental science
  • 117. Medical Renaissance is Part of a Paradigm Shift © 2014 MapR Technologies 117 • The Structure of Scientific Revolutions, Kuhn, 1962
  • 118. 4th Paradigm: eScience, enabled by BigData explore simulate explain © 2014 MapR Technologies 118 4th paradigm: eScience accumulated knowledge computing big data engine => “ escape velocity”
  • 119. © 2014 http://en.wikipedia.org/wiki/The_Structure_of_ MSapRc Tieechnnotliofgiices_Revolution1s19 We are here
  • 120. © 2014 MapR Technologies, confidential Thanks! Questions? @allenday, @mapr linkedin.com/in/allenday aday@mapr.com “Good news – I hear the paradigm is shifting” ~ New Yorker, Charles Barsotti

Editor's Notes

  1. As a biologist would say, it’s not true with the exception of clone.
  2. We do have human clones today, by the way. They’re called identical twins.
  3. But as the whole is the sum of its parts, biologists disagree because the DNA of individuals is not equal As usual, there is one exception
  4. Exasperated Walter – indicates this slide contains opinion content This is a tech talk, not a sermon
  5. Emphasize 4th paradigm
  6. Show some actual chemistry. Here or earlier. “Here is how a sequencer works” around slide 17. Add more emphasis that Xten is a cluster of machines. Maybe include a network diagram of sequencers => storage => compute
  7. Dating websites
  8. cinical
  9. What is destructuring?
  10. cinical
  11. Call out in layman terms rather than bigO. Label summing bird.
  12. Gene therapy???
  13. Put in a picture of some beda fish
  14. Put in a picture of some beda fish
  15. 102
  16. Increase GDP by 2%
  17. BOOM LSH
  18. BOOM LSH