2015 illinois-talk

AWager for 2016: How
SoftwareWill Beat Hardware
in Biological Data Analysis
C.Titus Brown
Associate Professor
PHR, School ofVeterinary Medicine, UC Davis
This talk on slideshare: slideshare.net/c.titus.brown/

This talk idea started with an argument on the
Internet.
xkcd.com/386/ - “Duty Calls”

https://twitter.com/ctitusbrown/status/535191544119451648

The obligatory slide about abundant
sequencing data.
http://www.genome.gov/sequencingcosts/
Also see: https://biomickwatson.wordpress.com/2015/03/25/the-cost-of-sequencing-is-
still-going-down/

Big Sequencing Data and Biology
1) Listen to the physicists: “look, we know how
to analyze data from CERN and Sloan Digital
Sky Survey. Just do what we did.”
2) Listen to the SiliconValley folk: “Hadoop,
and Spark, dude. Just map-reduce it.”
3) Develop custom approaches.

Shotgun sequencing
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of
foolishness
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness

Resequencing analysis
We know a reference genome (specific edition), and
want to find variants (differences - blue) in a
background of errors (red)

The scale of the problem (1)
Lots of data per “book”
• A human genome contains approximately 6
billion bases of DNA.
• Covering the entire genome using random
sampling requires ~150 billion bases of
sequencing

Many “editions” in e.g. cancer
If you want to look at 1000 individual tumor
cells and build an evolutionary history of
changes, you need 150 Gbp per cell: 150Tbp.

Many sequencers, many analyses.
• 10,000 sequencers worldwide (?)
• Worldwide sequencing capacity ??, but
~300,000 human genomes in 2014…
• Many research groups, each with own
question(s) - ~1m data sets each year?
• Cheap! ~$10-20k for a 100 Gbp data set.

Mapping: locate reads in reference
(pass 1)
http://en.wikipedia.org/wiki/File:Mapping_Reads.png

Variant detection after mapping
(pass 2, 3, and 4)
http://www.kenkraaijeveld.nl/genomics/bioinformatics/

The current variant calling approach:
Map reads
Convert to
binary
Sort binary
format by
genome pos'n
"Pile up" and
call variants
Extract reads
for tricky bits
Realign/
assemble
(optional)

Current approach: pros and cons
Pros:
• Modular and flexible.
• Open source!Well supported! Mature!
• Some of it parallelizes easily!
Cons:
• 4+ passes across the data
• Very I/O intensive (hence unsuitable for cloud).

Some numbers:
• 1000 single cells from a tumors ~ 150Tbp of data.
• HiSeq X10 can do the sequencing in ~3 weeks.
• The variant calling requires ~2,000 CPU weeks…
• …so, given ~2,000 computers, can do this all in
one month.
…but, multiply problem by # of possible patients...

Big Sequencing Data and Biology
1) Listen to the physicists: “look, we know how to
analyze data from CERN and Sloan Digital Sky
Survey. Just do what we did.”
2) Listen to the SiliconValley folk: “Hadoop, and Spark,
dude. Just map-reduce it.”
3) Develop better custom approaches, swiping ideas
from SiliconValley and physicists as needed.

So, back to the Internet argument:
it ended with a bet.
In two years (Nov 2016), my 9 year old daughter
will be able to analyze a full human genome
sequence on her desktop computer.
https://twitter.com/ctitusbrown/status/535191544119451648

“Never compete unless you have an unfair advantage.”
1. My daughter is awesome.
2.We know how to do it
already*
(* some assembly required)
3. Heng Li just posted a
preprint yesterday!
“FermiKit”, http://arxiv.org/abs/1504.06574

Remainder of talk – outline.
1. “Data” vs “information”
2. Streaming approaches to lossy compression
and building compressible graphs for soil
metagenomics.
3. Sequencing errors and variants using graphs.

Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
De Bruijn graphs (sequencing graphs) scale with
data size, not information size.

Why do sequence graphs scale
badly?
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set

Practical memory measurements
Velvet measurements (Adina Howe)

Our solution: lossy compression

Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.

Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)

Actual coverage varies widely from the average.
Low coverage introduces unavoidable breaks.

But! Shotgun sequencing is very redundant!
Lots of the high coverage simply isn’t needed.
(unnecessary data)

Graph sizes now scales with information content.
Most samples can be reconstructed via de
novo assembly on commodity computers.

Diginorm ~ “lossy compression”
Nearly perfect from an information theoretic
perspective:
– Discards 95% more of data for genomes.
– Loses < 00.02% of information.

This changes the way analyses
scale.

Streaming lossy compression:
for read in dataset:
if estimated_coverage(read) < CUTOFF:
yield read
This is literally a three line algorithm. Not kidding.
It took four years to figure out which three lines, though…

Diginorm can detect information
saturation in a stream.
Zhang et al., submitted.

This generically permits semi-streaming
analytical approaches.

e.g. E. coli analysis => ~1.2 pass, sublinear
memory

Another simple algorithm.

Single pass, reference free, tunable, streaming online
variant calling.
Error detection  variant calling

Real time / streaming data
analysis.
Raw data
(real time, from
sequencer?)
Error trimming
Variant calling
De novo
assembly

Stream all the things!
This code works.

Preliminary benchmarks -
• Can do variant calling on E. coli in about 5
minutes, in 40 MB of RAM, with a single
thread, with no optimization.
• Scaling to human should be readily feasible.
• …I have another 18 months before I lose the
bet.

My real point -
• We need well founded, and flexible, and algorithmically
efficient, and high performance components for
sequence data manipulation in biology.
• We are building these on top of a streaming and low
memory paradigm.
• We are building out a scripting library for composing
these operations.

Scaling compute, or algorithms?
There are some problems that require big computers &
many processors.
Genomic data analysis shouldn’t be one of them, based
on information content alone!
(This is probably good, given the scale of the need.)
Many other biological problems do require big compute,
however.

Reminder: the real challenge is
understanding
We have gotten distracted by shiny toys: sequencing!!
Data!!
Data is now plentiful! But:
We typically have no knowledge of what > 50% of an
e.g. environmental metagenome “means”,
functionally.
http://ivory.idyll.org/blog/2014-function-of-unknown-genes.html

I was going to give you my 5 year
vision…
…but I don’t have 20/20 eyesight.
Via @adrianholovaty

I was going to give you my 5 year
vision…
…but I don’t have 20/20 eyesight.
(20/20? 2020? 2015 + 5?)
(My wife has asked that I apologize for this
joke.)
Via @adrianholovaty

Data integration as a next
challenge
In 5-10 years, we will have nigh-infinite data.
(Genomic, transcriptomic, proteomic, metabolomic,
…?)
How do we explore these data sets?
Registration, cross-validation, integration with
models…

Carbon cycling in the ocean -
“DeepDOM” cruise, Kujawinski & Longnecker et al.

Integrating many different data types to
build understanding.
Figure 2. Summary of challenges associated with the data integration in the proposed project.
“DeepDOM” cruise: examination of dissolved organic matter & microbial
metabolism vs physical parameters – potential collab.

A few thoughts on practical next
steps.
• Enable scientists with better tools.
• Train a bioinformatics “middle class.”
• Accelerate science via the open science “network
effect”.

That is… what do we do now?
Once you have all this data, what do you do?
"Business as usual simply cannot work.”
- David Haussler, 2014
Looking at millions to billions of (human) genomes in
the next 5-10 years.

Enabling scientists with better tools -
Build robust, flexible computational
frameworks for data exploration, and make
them open and remixable.
Develop theory, algorithms, & software
together, and train people in their use.
(Stop pretending that we can develop “black
boxes” that will give you the right answer.)

Education and training - towards a
bioinformatics “middle class”
Biology is underprepared for data-intensive investigation.
We must teach and train the next generations.
=> Build a cohort of “data intensive biologists” who can use
data and tools as an intrinsic and unremarkable part of their
research.
~10-20 workshops / year, novice -> masterclass; open
materials.
dib-training.rtfd.org/

Can open science trigger a
“network effect”?
http://prasoondiwakar.com/wordpress/trivia/the-network-effect

So: can we drive data sharing via a decentralized
model, e.g. a distributed graph database?
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-award.html

My larger research vision:
100% buzzword compliantTM
Enable and incentivize sharing by providing immediate utility;
frictionless sharing.
Permissionless innovation for e.g. new data mining
approaches.
Plan for poverty with federated infrastructure built on open &
cloud.
Solve people’s current problems, while remaining agile for
the future.
ivory.idyll.org/blog/2014-moore-ddd-award.html

Thanks!
Please contact me at ctbrown@ucdavis.edu!

2015 illinois-talk

More Related Content

2015 illinois-talk

Editor's Notes