Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience
GOOGLE CONFIDENTIAL
Google Cloud
Run your apps on the same system as Google
Environments
Genotypes
Quantifying Phenotypes
a
Googler’s
perspective
Generate Marker Fingerprint
Select & Recombine Sample tissue
Breeding
Genotyping Lab
Extract DNAAnalyze & Model Data
Grow
Marker-Assisted Breeding Rapidly Increases Frequency of
Favorable Genes
Cloud ML
TensorFlow
AI & ML
what you need to know
Machine Learning:
Make Machines
Learn
Artificial Intelligence:
Make Intelligent
Machines
programming a computer
to be intelligent is hard
programming a computer
to learn to be intelligent
is easier and progress is
measurable
* Human Performance
based on analysis done
by Andrej Karpathy.
More details here.
Image understanding is (getting) better than human level
ImageNet Challenge: Given
an image, predict one of
1000+ of classes
%errors
Deep Neural Networks: Algorithms that Learn
● Modernization of artificial neural networks
● Made of of simple mathematical units,
organized in layers, that together can
compute some (arbitrary) function
● more layers = deeper = more general
● Learn from raw, heterogeneous data
“Given an image,
predict one of
1000+ of classes”
Image credit:
360phot0.blogspot.com
ImageNet
Challenge
Released in Nov. 2015
#1
repository
for “machine learning”
category on GitHub
TensorFlow
Style Transfer
Transfer Learning
Quickly able to Learn New Concepts
“t-rex”“quidditch”
Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images
TensorFlow powered Cucumber Sorter
Generate Marker Fingerprint
Select & Recombine Sample tissue
Breeding
Genotyping Lab
Extract DNAAnalyze & Model Data
Grow
Marker-Assisted Breeding Rapidly Increases Frequency of
Favorable Genes
Cloud ML
TensorFlow
Genomics & Genetics Problems:
How to Start Applying DNNs?
Must-haves for deep learning:
● Lots of data: >50k examples, >1M examples ideal
● High-quality input and labels for training
● Label ~ F(data) unknown but certainly function exists
● High-quality prev. efforts so we know that DNNs are key
○ i.e. hard to solve with classical statistical
approaches
SNP and indel calling from NGS data
Environments
Phenotypes
Quantifying Genotypes
20170402 Crop Innovation and Business - Amsterdam
Creating a universal SNP and small indel
variant caller with deep neural networks
Ryan Poplin, Cory McLean, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy,
Sam Gross, Madeleine Cule, Peyton Greenside, Justin Zook, Marc Salit, Mark
DePristo, Verily Life Sciences, October 2016
DNN (Inception V3) Predicts True Genotype from Pileup Images
{ 0.001, 0.994, 0.005 }
{ 0.001, 0.990, 0.009 }
{ 0.000, 0.001, 0.999 }
{ 0.600, 0.399, 0.001 }
Output:
Probability of diploid
genotype states
{ HOM_REF, HET, HOM_VAR }
Raw pixels
Input:
Millions of labeled pileup
images from gold standard
samples
DeepVariant #1 in PrecisionFDA Truth Challenge
v2 => v3 truth set
for unblinded
sample
Unblinded =>
blinded sample with
v3 truth set
99.85
99.70
98.91
Genotypes
Phenotypes
Optimizing Environments
Quantifying
&
⬇40% Data Center cooling energy
⬆15% Power Usage Effectiveness (PUE)
Google’s Carbon-Neutral, Self-Optimizing Data Centers
The Dalles, Oregon, USA
anezconsulting.com/precision-agronomy/
Agronometric Integration
● Satellite & UAV
Images
● Geological Data
● Meteorological
& Sensor Data
● Cultivar Data
● Other GIS Data
● Yield Data
TensorFlow
https://cloudplatform.googleblog.com/2015/11/startup-spotlight-Descartes-Labs-monitors-planet-Earths-resources-with-Google-Compute-Engine.html
Public Datasets Project
https://cloud.google.com/bigquery/public-data/
A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a
special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications.
Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the
queries that you perform on the data (the first 1TB per month is free)
Environments
Genotypes
Optimizing Phenotypes
Marker-assisted selection for quantitative traits
“Marker Assisted
Selection”
&
“Quantitative
Trait Locus”
Occurrence in
Literature is
Increasing
GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto
https://www.youtube.com/watch?v=6KEvLURBenM
GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto
https://www.youtube.com/watch?v=6KEvLURBenM
PubSub
Queue
Sequencer
Reads
Genomics
APIs,
Docker
Revise
Models
Models
Cloud ML
MAB
Enhance
Percolate Streaming Sequencer Reads
for Real-time Model Updates
BigQuery
Google confidential │ Do not distribute
Google is good at handling massive volumes of data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
Google confidential │ Do not distribute
Google can Handle Massive Amounts of Genomic Data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
~6 Maize WGS
>100x US PhDs
~1M WGS
0.25s
PubSub
Queue
Genomics
APIs,
Docker
Revise
Models
Models
MAB
Enhance
Percolate Streaming Sequencer Reads
for Real-time Model Updates
Who Else Needs This?
Sequencer
Reads
Cloud ML
BigQuery
New Public Dataset: 1K Cannabis
cloud.google.com/bigquery/public-data/1000-cannabis
Blog Post @ Medium:
DNA Sequencing of 1K Cannabis Strains publicly available in Google BigQuery
Open Source:
https://github.com/allenday/bfx-seq
Revise
Models
DNA
Reads
Build What’s Next
Thank You!
Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience

More Related Content

20170402 Crop Innovation and Business - Amsterdam