Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Memory- and time-efficient 
approaches to sequence 
analysis with streaming 
algorithms 
C. Titus Brown 
ctb@msu.edu
Part I: Digital normalization
Problem: De Bruijn assembly graphs scale 
with data size, not information. 
Conway T C , Bromage A J Bioinformatics 2011;27:479-486 
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, 
please email: journals.permissions@oup.com
This is the effect of errors: 
Single nucleotide variations cause long branches
This is the effect of errors: 
Single nucleotide variations cause long branches; 
They don’t rejoin quickly.
Can we change this scaling behavior? 
Conway T C , Bromage A J Bioinformatics 2011;27:479-486 
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, 
please email: journals.permissions@oup.com
An apparent digression: 
Much of next-gen sequencing is redundant.
Shotgun sequencing and 
coverage 
“Coverage” is simply the average number of reads that overlap 
each true base in genome. 
Here, the coverage is ~10 – just draw a line straight down from the 
top through all of the reads.
Random sampling => deep sampling 
needed 
Typically 10-100x needed for robust recovery (300 Gbp for human)
An apparent digression: 
Much of next-gen sequencing is redundant. 
Can we eliminate this redundancy?
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Basic diginorm algorithm 
We can build the approach on anything that lets us estimate coverage of a read. 
for read in dataset: 
if estimated_coverage(read) < CUTOFF: 
update_kmer_counts(read) 
save(read) 
else: 
# discard read 
Note, single pass; sublinear memory.
The median k-mer count in a “sentence” is a 
~good estimator of coverage. 
This gives us a 
reference-free 
measure of 
coverage.
Digital normalization is 
streaming
Digital normalization is 
streaming
Digital normalization is 
streaming
Digital normalization is 
streaming
Digital normalization is 
streaming
Digital normalization is 
streaming
Digital normalization retains information, while 
discarding data and errors
Digital normalization is 
streaming error correction
Digital normalization retains information, while 
discarding data and errors
Contig assembly now scales with underlying genome 
size 
 Transcriptomes, microbial genomes incl MDA, and 
most metagenomes can be assembled in under 50 
GB of RAM, with ~identical or improved results.
Victory! (?) 
Conway T C , Bromage A J Bioinformatics 2011;27:479-486 
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, 
please email: journals.permissions@oup.com
A few “minor” drawbacks… 
1. Repeats are eliminated preferentially. 
2. Genuine graph tips are truncated. 
3. Polyploidy is downsampled. 
4. It’s not clear what happens to polymorphism. 
(For these reasons, we have been pursuing alternate 
approaches.) 
Partially discussed in Brown et al., 2012 (arXiv)
But still quite useful… 
1. Assembling soil metagenomes. 
Howe et al., PNAS, 2014 (w/Tiedje) 
2. Understanding bone-eating worm symbionts. 
Goffredi et al., ISME, 2014. 
3. An ultra-deep look at the lamprey transcriptome. 
Scott et al., in preparation (w/Li) 
4. Understanding development in Molgulid ascidians. 
Stolfi et al, eLife 2014; etc.
…and widely used (?) 
Estimated ~1000 users of our software. 
Diginorm algorithm now included in Trinity software 
from Broad Institute (~10,000 users) 
Illumina TruSeq long-read technology now 
incorporates our approach (~100,000 users)
Part II: Wait, did you say 
streaming?
Diginorm can detect graph 
saturation
Graph saturation 
for read in dataset: 
if estimated_coverage(read) < CUTOFF: 
update_kmer_counts(read) 
save(read) 
else: 
# high coverage read: do something clever!
“Few-pass” approach 
By 20% of the way through 100x data set, more 
than half the reads are saturated to 20x
Graph saturation 
for read in dataset: 
if estimated_coverage(read) < CUTOFF: 
update_kmer_counts(read) 
save(read) 
else: 
# high coverage read: do something clever!
(A) Streaming error detection for 
metagenomes and transcriptomes 
 Illumina has between 0.1% and 1% error rate. 
 These errors confound mapping, assembly, etc. 
(Think: what if you had error free reads? Life would be 
much better)
Spectral error detection for genomes 
Chaisson et al., 2009 
True k-mers 
Erroneous k-mers
Spectral error detection on 
reads -- 
Error location!
…spectral error detection for reads => 
transcriptome, metagenome 
True k-mers 
Erroneous k-mers 
Chaisson et al., 2009
Spectral error detection on 
variable coverage data 
How many of the errors can we pinpoint exactly? 
f saturated Specificity Sensitivity 
Genome 100% 71.4% 77.9% 
Transcriptome 92% 67.7% 63.8% 
Metagenome 96% 71.2% 68.9% 
Real E. coli 100% 51.1% 72.4%
(B) Streaming error trimming for all shotgun 
data 
We can trim reads at first error. 
f saturated error rate 
total bases 
trimmed 
errors 
remaining 
Genome 100% 0.63% 31.90% 0.00% 
Transcriptome 
92% 0.65% 34.34% 0.07% 
Metagenome 
96% 0.62% 31.70% 0.04% 
Real E. coli 100% 1.59% 12.96% 0.05%
(C) Streaming error correction 
 Once you can do error detection and trimming on a 
streaming basis, why not error correction? 
 …using a new approach…
Streaming error correction of genomic, transcriptomic, 
metagenomic data via graph alignment 
Jason Pell, Jordan Fish, Michael Crusoe
Pair-HMM-based graph 
alignment 
Jordan Fish and Michael Crusoe
…a bit more complex... 
Jordan Fish and Michael Crusoe
Error correction on simulated E. 
coli data 
TP FP TN FN 
Streaming 3,494,631 3,865 460,601,171 5,533 
(corrected) (mistakes) (OK) (missed) 
1% error rate, 100x coverage. 
Michael Crusoe, Jordan Fish, Jason Pell
2014 anu-canberra-streaming
2014 anu-canberra-streaming
A few additional thoughts -- 
 Sequence-to-graph alignment is a very general 
concept. 
 Could replace mapping, variant calling, BLAST, 
HMMER… 
“Ask me for anything but time!” 
-- Napoleon Bonaparte
(D) Calculating read error rates 
by position within read 
 Shotgun data is randomly 
sampled; 
 Any variation in mismatches with 
reference by position is likely due 
to errors or bias. 
Reads 
Assemble 
Map reads to 
assembly 
Calculate position-specific 
mismatches
Sequencing run error profiles 
Via bowtie mapping against reference -- 
Reads from Shakya et al., pmid 23387867
We can do this sub-linearly from data w/no 
reference! 
Reads from Shakya et al., pmid 23387867
Reference-free error profile 
analysis 
1. Requires no prior information! 
2. Immediate feedback on sequencing quality (for cores 
& users) 
3. Fast, lightweight (~100 MB, ~2 minutes) 
4. Works for any shotgun sample (genomic, 
metagenomic, transcriptomic). 
5. Not affected by polymorphisms.
Reference-free error profile 
analysis 
7. …if we know where the errors are, we can trim them. 
8. …if we know where the errors are, we can correct them. 
9. …if we look at differences by graph position instead of by 
read position, we can call variants. 
=> Streaming, online variant calling?
Future thoughts / streaming 
How far can we take this?
Streaming approach supports more compute-intensive 
interludes – remapping, etc. 
Rimmer et al., 2014
Streaming online reference-free variant calling. 
Single pass, reference free, tunable, streaming 
online variant calling.
Streaming with reads… 
Sequence... 
Graph 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
.... 
Variants
Analysis is done after 
sequencing. 
Sequencing Analysis
Streaming with bases 
k bases... 
Graph 
k+1 
k bases... k+1 
k+2 
k bases... k+1 
k bases... k+1 
k bases... k+1 
... 
k bases... k+1 
Variants
Integrate sequencing and 
analysis 
Sequencing 
Analysis 
Are we done yet?
Directions for streaming graph 
analysis 
 Generate error profile for shotgun reads; 
 Variable coverage error trimming; 
 Streaming low-memory error correction for genomes, 
metagenomes, and transcriptomes; 
 Strain variant detection & resolution; 
 Streaming variant analysis. 
Michael Crusoe, Jordan Fish & Jason Pell
Our software is open source 
Methods that aren’t broadly available are limited in their 
utility! 
 Everything I talked about is in our github repository, 
http://github.com/ged-lab/khmer 
 …it’s not necessarily trivial to use… 
 …but we’re happy to help.
We have recipes!
Planned work: distributed graph database server 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI) 
ivory.idyll.org/blog/2014-moore-ddd-talk.html
Thanks for listening!

More Related Content

2014 anu-canberra-streaming

  • 1. Memory- and time-efficient approaches to sequence analysis with streaming algorithms C. Titus Brown ctb@msu.edu
  • 2. Part I: Digital normalization
  • 3. Problem: De Bruijn assembly graphs scale with data size, not information. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 4. This is the effect of errors: Single nucleotide variations cause long branches
  • 5. This is the effect of errors: Single nucleotide variations cause long branches; They don’t rejoin quickly.
  • 6. Can we change this scaling behavior? Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 7. An apparent digression: Much of next-gen sequencing is redundant.
  • 8. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 9. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  • 10. An apparent digression: Much of next-gen sequencing is redundant. Can we eliminate this redundancy?
  • 17. Basic diginorm algorithm We can build the approach on anything that lets us estimate coverage of a read. for read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; sublinear memory.
  • 18. The median k-mer count in a “sentence” is a ~good estimator of coverage. This gives us a reference-free measure of coverage.
  • 25. Digital normalization retains information, while discarding data and errors
  • 26. Digital normalization is streaming error correction
  • 27. Digital normalization retains information, while discarding data and errors
  • 28. Contig assembly now scales with underlying genome size  Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with ~identical or improved results.
  • 29. Victory! (?) Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 30. A few “minor” drawbacks… 1. Repeats are eliminated preferentially. 2. Genuine graph tips are truncated. 3. Polyploidy is downsampled. 4. It’s not clear what happens to polymorphism. (For these reasons, we have been pursuing alternate approaches.) Partially discussed in Brown et al., 2012 (arXiv)
  • 31. But still quite useful… 1. Assembling soil metagenomes. Howe et al., PNAS, 2014 (w/Tiedje) 2. Understanding bone-eating worm symbionts. Goffredi et al., ISME, 2014. 3. An ultra-deep look at the lamprey transcriptome. Scott et al., in preparation (w/Li) 4. Understanding development in Molgulid ascidians. Stolfi et al, eLife 2014; etc.
  • 32. …and widely used (?) Estimated ~1000 users of our software. Diginorm algorithm now included in Trinity software from Broad Institute (~10,000 users) Illumina TruSeq long-read technology now incorporates our approach (~100,000 users)
  • 33. Part II: Wait, did you say streaming?
  • 34. Diginorm can detect graph saturation
  • 35. Graph saturation for read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # high coverage read: do something clever!
  • 36. “Few-pass” approach By 20% of the way through 100x data set, more than half the reads are saturated to 20x
  • 37. Graph saturation for read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # high coverage read: do something clever!
  • 38. (A) Streaming error detection for metagenomes and transcriptomes  Illumina has between 0.1% and 1% error rate.  These errors confound mapping, assembly, etc. (Think: what if you had error free reads? Life would be much better)
  • 39. Spectral error detection for genomes Chaisson et al., 2009 True k-mers Erroneous k-mers
  • 40. Spectral error detection on reads -- Error location!
  • 41. …spectral error detection for reads => transcriptome, metagenome True k-mers Erroneous k-mers Chaisson et al., 2009
  • 42. Spectral error detection on variable coverage data How many of the errors can we pinpoint exactly? f saturated Specificity Sensitivity Genome 100% 71.4% 77.9% Transcriptome 92% 67.7% 63.8% Metagenome 96% 71.2% 68.9% Real E. coli 100% 51.1% 72.4%
  • 43. (B) Streaming error trimming for all shotgun data We can trim reads at first error. f saturated error rate total bases trimmed errors remaining Genome 100% 0.63% 31.90% 0.00% Transcriptome 92% 0.65% 34.34% 0.07% Metagenome 96% 0.62% 31.70% 0.04% Real E. coli 100% 1.59% 12.96% 0.05%
  • 44. (C) Streaming error correction  Once you can do error detection and trimming on a streaming basis, why not error correction?  …using a new approach…
  • 45. Streaming error correction of genomic, transcriptomic, metagenomic data via graph alignment Jason Pell, Jordan Fish, Michael Crusoe
  • 46. Pair-HMM-based graph alignment Jordan Fish and Michael Crusoe
  • 47. …a bit more complex... Jordan Fish and Michael Crusoe
  • 48. Error correction on simulated E. coli data TP FP TN FN Streaming 3,494,631 3,865 460,601,171 5,533 (corrected) (mistakes) (OK) (missed) 1% error rate, 100x coverage. Michael Crusoe, Jordan Fish, Jason Pell
  • 51. A few additional thoughts --  Sequence-to-graph alignment is a very general concept.  Could replace mapping, variant calling, BLAST, HMMER… “Ask me for anything but time!” -- Napoleon Bonaparte
  • 52. (D) Calculating read error rates by position within read  Shotgun data is randomly sampled;  Any variation in mismatches with reference by position is likely due to errors or bias. Reads Assemble Map reads to assembly Calculate position-specific mismatches
  • 53. Sequencing run error profiles Via bowtie mapping against reference -- Reads from Shakya et al., pmid 23387867
  • 54. We can do this sub-linearly from data w/no reference! Reads from Shakya et al., pmid 23387867
  • 55. Reference-free error profile analysis 1. Requires no prior information! 2. Immediate feedback on sequencing quality (for cores & users) 3. Fast, lightweight (~100 MB, ~2 minutes) 4. Works for any shotgun sample (genomic, metagenomic, transcriptomic). 5. Not affected by polymorphisms.
  • 56. Reference-free error profile analysis 7. …if we know where the errors are, we can trim them. 8. …if we know where the errors are, we can correct them. 9. …if we look at differences by graph position instead of by read position, we can call variants. => Streaming, online variant calling?
  • 57. Future thoughts / streaming How far can we take this?
  • 58. Streaming approach supports more compute-intensive interludes – remapping, etc. Rimmer et al., 2014
  • 59. Streaming online reference-free variant calling. Single pass, reference free, tunable, streaming online variant calling.
  • 60. Streaming with reads… Sequence... Graph Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... .... Variants
  • 61. Analysis is done after sequencing. Sequencing Analysis
  • 62. Streaming with bases k bases... Graph k+1 k bases... k+1 k+2 k bases... k+1 k bases... k+1 k bases... k+1 ... k bases... k+1 Variants
  • 63. Integrate sequencing and analysis Sequencing Analysis Are we done yet?
  • 64. Directions for streaming graph analysis  Generate error profile for shotgun reads;  Variable coverage error trimming;  Streaming low-memory error correction for genomes, metagenomes, and transcriptomes;  Strain variant detection & resolution;  Streaming variant analysis. Michael Crusoe, Jordan Fish & Jason Pell
  • 65. Our software is open source Methods that aren’t broadly available are limited in their utility!  Everything I talked about is in our github repository, http://github.com/ged-lab/khmer  …it’s not necessarily trivial to use…  …but we’re happy to help.
  • 67. Planned work: distributed graph database server Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-talk.html

Editor's Notes

  1. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  2. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  3. Note that any such measure will do.
  4. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  5. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  6. The point is to enable biology; volume and velocity of data from sequencers is blocking.
  7. Update from Jordan
  8. Analyze data in cloud; import and export important; connect to other databases.