Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
AWager for 2016: How
SoftwareWill Beat Hardware
in Biological Data Analysis
C.Titus Brown
Associate Professor
PHR, School ofVeterinary Medicine, UC Davis
This talk on slideshare: slideshare.net/c.titus.brown/
This talk idea started with an argument on the
Internet.
xkcd.com/386/ - “Duty Calls”
https://twitter.com/ctitusbrown/status/535191544119451648
The obligatory slide about abundant
sequencing data.
http://www.genome.gov/sequencingcosts/
Also see: https://biomickwatson.wordpress.com/2015/03/25/the-cost-of-sequencing-is-
still-going-down/
Big Sequencing Data and Biology
1) Listen to the physicists: “look, we know how
to analyze data from CERN and Sloan Digital
Sky Survey. Just do what we did.”
2) Listen to the SiliconValley folk: “Hadoop,
and Spark, dude. Just map-reduce it.”
3) Develop custom approaches.
Shotgun sequencing
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of
foolishness
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
Resequencing analysis
We know a reference genome (specific edition), and
want to find variants (differences - blue) in a
background of errors (red)
The scale of the problem (1)
Lots of data per “book”
• A human genome contains approximately 6
billion bases of DNA.
• Covering the entire genome using random
sampling requires ~150 billion bases of
sequencing
The scale of the problem (2)
Many “editions” in e.g. cancer
If you want to look at 1000 individual tumor
cells and build an evolutionary history of
changes, you need 150 Gbp per cell: 150Tbp.
The scale of the problem (3)
Many sequencers, many analyses.
• 10,000 sequencers worldwide (?)
• Worldwide sequencing capacity ??, but
~300,000 human genomes in 2014…
• Many research groups, each with own
question(s) - ~1m data sets each year?
• Cheap! ~$10-20k for a 100 Gbp data set.
Resequencing analysis
We know a reference genome (specific edition), and
want to find variants (differences - blue) in a
background of errors (red)
Mapping: locate reads in reference
(pass 1)
http://en.wikipedia.org/wiki/File:Mapping_Reads.png
Variant detection after mapping
(pass 2, 3, and 4)
http://www.kenkraaijeveld.nl/genomics/bioinformatics/
The current variant calling approach:
Map reads
Convert to
binary
Sort binary
format by
genome pos'n
"Pile up" and
call variants
Extract reads
for tricky bits
Realign/
assemble
(optional)
Current approach: pros and cons
Pros:
• Modular and flexible.
• Open source!Well supported! Mature!
• Some of it parallelizes easily!
Cons:
• 4+ passes across the data
• Very I/O intensive (hence unsuitable for cloud).
Some numbers:
• 1000 single cells from a tumors ~ 150Tbp of data.
• HiSeq X10 can do the sequencing in ~3 weeks.
• The variant calling requires ~2,000 CPU weeks…
• …so, given ~2,000 computers, can do this all in
one month.
…but, multiply problem by # of possible patients...
Big Sequencing Data and Biology
1) Listen to the physicists: “look, we know how to
analyze data from CERN and Sloan Digital Sky
Survey. Just do what we did.”
2) Listen to the SiliconValley folk: “Hadoop, and Spark,
dude. Just map-reduce it.”
3) Develop better custom approaches, swiping ideas
from SiliconValley and physicists as needed.
So, back to the Internet argument:
it ended with a bet.
In two years (Nov 2016), my 9 year old daughter
will be able to analyze a full human genome
sequence on her desktop computer.
https://twitter.com/ctitusbrown/status/535191544119451648
“Never compete unless you have an unfair advantage.”
1. My daughter is awesome.
2.We know how to do it
already*
(* some assembly required)
3. Heng Li just posted a
preprint yesterday!
“FermiKit”, http://arxiv.org/abs/1504.06574
Remainder of talk – outline.
1. “Data” vs “information”
2. Streaming approaches to lossy compression
and building compressible graphs for soil
metagenomics.
3. Sequencing errors and variants using graphs.
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
De Bruijn graphs (sequencing graphs) scale with
data size, not information size.
Why do sequence graphs scale
badly?
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
Practical memory measurements
Velvet measurements (Adina Howe)
Our solution: lossy compression
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
Actual coverage varies widely from the average.
Low coverage introduces unavoidable breaks.
But! Shotgun sequencing is very redundant!
Lots of the high coverage simply isn’t needed.
(unnecessary data)
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Graph sizes now scales with information content.
Most samples can be reconstructed via de
novo assembly on commodity computers.
Diginorm ~ “lossy compression”
Nearly perfect from an information theoretic
perspective:
– Discards 95% more of data for genomes.
– Loses < 00.02% of information.
This changes the way analyses
scale.
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Streaming lossy compression:
for read in dataset:
if estimated_coverage(read) < CUTOFF:
yield read
This is literally a three line algorithm. Not kidding.
It took four years to figure out which three lines, though…
Diginorm can detect information
saturation in a stream.
Zhang et al., submitted.
This generically permits semi-streaming
analytical approaches.
Zhang et al., submitted.
e.g. E. coli analysis => ~1.2 pass, sublinear
memory
Zhang et al., submitted.
Another simple algorithm.
Zhang et al., submitted.
Single pass, reference free, tunable, streaming online
variant calling.
Error detection  variant calling
Real time / streaming data
analysis.
Raw data
(real time, from
sequencer?)
Error trimming
Variant calling
De novo
assembly
Stream all the things!
This code works.
Preliminary benchmarks -
• Can do variant calling on E. coli in about 5
minutes, in 40 MB of RAM, with a single
thread, with no optimization.
• Scaling to human should be readily feasible.
• …I have another 18 months before I lose the
bet.
My real point -
• We need well founded, and flexible, and algorithmically
efficient, and high performance components for
sequence data manipulation in biology.
• We are building these on top of a streaming and low
memory paradigm.
• We are building out a scripting library for composing
these operations.
Scaling compute, or algorithms?
There are some problems that require big computers &
many processors.
Genomic data analysis shouldn’t be one of them, based
on information content alone!
(This is probably good, given the scale of the need.)
Many other biological problems do require big compute,
however.
Reminder: the real challenge is
understanding
We have gotten distracted by shiny toys: sequencing!!
Data!!
Data is now plentiful! But:
We typically have no knowledge of what > 50% of an
e.g. environmental metagenome “means”,
functionally.
http://ivory.idyll.org/blog/2014-function-of-unknown-genes.html
I was going to give you my 5 year
vision…
…but I don’t have 20/20 eyesight.
Via @adrianholovaty
I was going to give you my 5 year
vision…
…but I don’t have 20/20 eyesight.
(20/20? 2020? 2015 + 5?)
(My wife has asked that I apologize for this
joke.)
Via @adrianholovaty
Data integration as a next
challenge
In 5-10 years, we will have nigh-infinite data.
(Genomic, transcriptomic, proteomic, metabolomic,
…?)
How do we explore these data sets?
Registration, cross-validation, integration with
models…
Carbon cycling in the ocean -
“DeepDOM” cruise, Kujawinski & Longnecker et al.
Integrating many different data types to
build understanding.
Figure 2. Summary of challenges associated with the data integration in the proposed project.
“DeepDOM” cruise: examination of dissolved organic matter & microbial
metabolism vs physical parameters – potential collab.
Data/analysis lifecycle
A few thoughts on practical next
steps.
• Enable scientists with better tools.
• Train a bioinformatics “middle class.”
• Accelerate science via the open science “network
effect”.
That is… what do we do now?
Once you have all this data, what do you do?
"Business as usual simply cannot work.”
- David Haussler, 2014
Looking at millions to billions of (human) genomes in
the next 5-10 years.
Enabling scientists with better tools -
Build robust, flexible computational
frameworks for data exploration, and make
them open and remixable.
Develop theory, algorithms, & software
together, and train people in their use.
(Stop pretending that we can develop “black
boxes” that will give you the right answer.)
Education and training - towards a
bioinformatics “middle class”
Biology is underprepared for data-intensive investigation.
We must teach and train the next generations.
=> Build a cohort of “data intensive biologists” who can use
data and tools as an intrinsic and unremarkable part of their
research.
~10-20 workshops / year, novice -> masterclass; open
materials.
dib-training.rtfd.org/
Can open science trigger a
“network effect”?
http://prasoondiwakar.com/wordpress/trivia/the-network-effect
So: can we drive data sharing via a decentralized
model, e.g. a distributed graph database?
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-award.html
My larger research vision:
100% buzzword compliantTM
Enable and incentivize sharing by providing immediate utility;
frictionless sharing.
Permissionless innovation for e.g. new data mining
approaches.
Plan for poverty with federated infrastructure built on open &
cloud.
Solve people’s current problems, while remaining agile for
the future.
ivory.idyll.org/blog/2014-moore-ddd-award.html
Thanks!
Please contact me at ctbrown@ucdavis.edu!

More Related Content

2015 illinois-talk

  • 1. AWager for 2016: How SoftwareWill Beat Hardware in Biological Data Analysis C.Titus Brown Associate Professor PHR, School ofVeterinary Medicine, UC Davis This talk on slideshare: slideshare.net/c.titus.brown/
  • 2. This talk idea started with an argument on the Internet. xkcd.com/386/ - “Duty Calls”
  • 4. The obligatory slide about abundant sequencing data. http://www.genome.gov/sequencingcosts/ Also see: https://biomickwatson.wordpress.com/2015/03/25/the-cost-of-sequencing-is- still-going-down/
  • 5. Big Sequencing Data and Biology 1) Listen to the physicists: “look, we know how to analyze data from CERN and Sloan Digital Sky Survey. Just do what we did.” 2) Listen to the SiliconValley folk: “Hadoop, and Spark, dude. Just map-reduce it.” 3) Develop custom approaches.
  • 6. Shotgun sequencing It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness
  • 7. Resequencing analysis We know a reference genome (specific edition), and want to find variants (differences - blue) in a background of errors (red)
  • 8. The scale of the problem (1) Lots of data per “book” • A human genome contains approximately 6 billion bases of DNA. • Covering the entire genome using random sampling requires ~150 billion bases of sequencing
  • 9. The scale of the problem (2) Many “editions” in e.g. cancer If you want to look at 1000 individual tumor cells and build an evolutionary history of changes, you need 150 Gbp per cell: 150Tbp.
  • 10. The scale of the problem (3) Many sequencers, many analyses. • 10,000 sequencers worldwide (?) • Worldwide sequencing capacity ??, but ~300,000 human genomes in 2014… • Many research groups, each with own question(s) - ~1m data sets each year? • Cheap! ~$10-20k for a 100 Gbp data set.
  • 11. Resequencing analysis We know a reference genome (specific edition), and want to find variants (differences - blue) in a background of errors (red)
  • 12. Mapping: locate reads in reference (pass 1) http://en.wikipedia.org/wiki/File:Mapping_Reads.png
  • 13. Variant detection after mapping (pass 2, 3, and 4) http://www.kenkraaijeveld.nl/genomics/bioinformatics/
  • 14. The current variant calling approach: Map reads Convert to binary Sort binary format by genome pos'n "Pile up" and call variants Extract reads for tricky bits Realign/ assemble (optional)
  • 15. Current approach: pros and cons Pros: • Modular and flexible. • Open source!Well supported! Mature! • Some of it parallelizes easily! Cons: • 4+ passes across the data • Very I/O intensive (hence unsuitable for cloud).
  • 16. Some numbers: • 1000 single cells from a tumors ~ 150Tbp of data. • HiSeq X10 can do the sequencing in ~3 weeks. • The variant calling requires ~2,000 CPU weeks… • …so, given ~2,000 computers, can do this all in one month. …but, multiply problem by # of possible patients...
  • 17. Big Sequencing Data and Biology 1) Listen to the physicists: “look, we know how to analyze data from CERN and Sloan Digital Sky Survey. Just do what we did.” 2) Listen to the SiliconValley folk: “Hadoop, and Spark, dude. Just map-reduce it.” 3) Develop better custom approaches, swiping ideas from SiliconValley and physicists as needed.
  • 18. So, back to the Internet argument: it ended with a bet. In two years (Nov 2016), my 9 year old daughter will be able to analyze a full human genome sequence on her desktop computer. https://twitter.com/ctitusbrown/status/535191544119451648
  • 19. “Never compete unless you have an unfair advantage.” 1. My daughter is awesome. 2.We know how to do it already* (* some assembly required) 3. Heng Li just posted a preprint yesterday! “FermiKit”, http://arxiv.org/abs/1504.06574
  • 20. Remainder of talk – outline. 1. “Data” vs “information” 2. Streaming approaches to lossy compression and building compressible graphs for soil metagenomics. 3. Sequencing errors and variants using graphs.
  • 21. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com De Bruijn graphs (sequencing graphs) scale with data size, not information size.
  • 22. Why do sequence graphs scale badly? Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  • 23. Practical memory measurements Velvet measurements (Adina Howe)
  • 24. Our solution: lossy compression Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 25. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 26. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (30-300 Gbp for human)
  • 27. Actual coverage varies widely from the average. Low coverage introduces unavoidable breaks.
  • 28. But! Shotgun sequencing is very redundant! Lots of the high coverage simply isn’t needed. (unnecessary data)
  • 35. Graph sizes now scales with information content. Most samples can be reconstructed via de novo assembly on commodity computers.
  • 36. Diginorm ~ “lossy compression” Nearly perfect from an information theoretic perspective: – Discards 95% more of data for genomes. – Loses < 00.02% of information.
  • 37. This changes the way analyses scale. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 38. Streaming lossy compression: for read in dataset: if estimated_coverage(read) < CUTOFF: yield read This is literally a three line algorithm. Not kidding. It took four years to figure out which three lines, though…
  • 39. Diginorm can detect information saturation in a stream. Zhang et al., submitted.
  • 40. This generically permits semi-streaming analytical approaches. Zhang et al., submitted.
  • 41. e.g. E. coli analysis => ~1.2 pass, sublinear memory Zhang et al., submitted.
  • 42. Another simple algorithm. Zhang et al., submitted.
  • 43. Single pass, reference free, tunable, streaming online variant calling. Error detection  variant calling
  • 44. Real time / streaming data analysis. Raw data (real time, from sequencer?) Error trimming Variant calling De novo assembly
  • 45. Stream all the things! This code works.
  • 46. Preliminary benchmarks - • Can do variant calling on E. coli in about 5 minutes, in 40 MB of RAM, with a single thread, with no optimization. • Scaling to human should be readily feasible. • …I have another 18 months before I lose the bet.
  • 47. My real point - • We need well founded, and flexible, and algorithmically efficient, and high performance components for sequence data manipulation in biology. • We are building these on top of a streaming and low memory paradigm. • We are building out a scripting library for composing these operations.
  • 48. Scaling compute, or algorithms? There are some problems that require big computers & many processors. Genomic data analysis shouldn’t be one of them, based on information content alone! (This is probably good, given the scale of the need.) Many other biological problems do require big compute, however.
  • 49. Reminder: the real challenge is understanding We have gotten distracted by shiny toys: sequencing!! Data!! Data is now plentiful! But: We typically have no knowledge of what > 50% of an e.g. environmental metagenome “means”, functionally. http://ivory.idyll.org/blog/2014-function-of-unknown-genes.html
  • 50. I was going to give you my 5 year vision… …but I don’t have 20/20 eyesight. Via @adrianholovaty
  • 51. I was going to give you my 5 year vision… …but I don’t have 20/20 eyesight. (20/20? 2020? 2015 + 5?) (My wife has asked that I apologize for this joke.) Via @adrianholovaty
  • 52. Data integration as a next challenge In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic, metabolomic, …?) How do we explore these data sets? Registration, cross-validation, integration with models…
  • 53. Carbon cycling in the ocean - “DeepDOM” cruise, Kujawinski & Longnecker et al.
  • 54. Integrating many different data types to build understanding. Figure 2. Summary of challenges associated with the data integration in the proposed project. “DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab.
  • 56. A few thoughts on practical next steps. • Enable scientists with better tools. • Train a bioinformatics “middle class.” • Accelerate science via the open science “network effect”.
  • 57. That is… what do we do now? Once you have all this data, what do you do? "Business as usual simply cannot work.” - David Haussler, 2014 Looking at millions to billions of (human) genomes in the next 5-10 years.
  • 58. Enabling scientists with better tools - Build robust, flexible computational frameworks for data exploration, and make them open and remixable. Develop theory, algorithms, & software together, and train people in their use. (Stop pretending that we can develop “black boxes” that will give you the right answer.)
  • 59. Education and training - towards a bioinformatics “middle class” Biology is underprepared for data-intensive investigation. We must teach and train the next generations. => Build a cohort of “data intensive biologists” who can use data and tools as an intrinsic and unremarkable part of their research. ~10-20 workshops / year, novice -> masterclass; open materials. dib-training.rtfd.org/
  • 60. Can open science trigger a “network effect”? http://prasoondiwakar.com/wordpress/trivia/the-network-effect
  • 61. So: can we drive data sharing via a decentralized model, e.g. a distributed graph database? Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-award.html
  • 62. My larger research vision: 100% buzzword compliantTM Enable and incentivize sharing by providing immediate utility; frictionless sharing. Permissionless innovation for e.g. new data mining approaches. Plan for poverty with federated infrastructure built on open & cloud. Solve people’s current problems, while remaining agile for the future. ivory.idyll.org/blog/2014-moore-ddd-award.html
  • 63. Thanks! Please contact me at ctbrown@ucdavis.edu!

Editor's Notes

  1. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  2. High coverage is essential.
  3. High coverage is essential.
  4. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  5. Taking advantage of structure within read
  6. Passionate about training; necessary fro advancement of field; also deeply self-interested because I find out what the real problems are. (“Some people can do assembly” is not “everyone can do assembly”)
  7. Analyze data in cloud; import and export important; connect to other databases.
  8. Work with other Moore DDD folk on the data mining aspect. Start with cross validation, move to more sophisticated in-server implementations.