2014 anu-canberra-streaming

Memory- and time-efficient
approaches to sequence
analysis with streaming
algorithms
C. Titus Brown
ctb@msu.edu

Problem: De Bruijn assembly graphs scale
with data size, not information.
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com

This is the effect of errors:
Single nucleotide variations cause long branches

This is the effect of errors:
Single nucleotide variations cause long branches;
They don’t rejoin quickly.

Can we change this scaling behavior?

An apparent digression:
Much of next-gen sequencing is redundant.

Shotgun sequencing and
coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.

Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)

An apparent digression:
Much of next-gen sequencing is redundant.
Can we eliminate this redundancy?

Basic diginorm algorithm
We can build the approach on anything that lets us estimate coverage of a read.
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read
Note, single pass; sublinear memory.

The median k-mer count in a “sentence” is a
~good estimator of coverage.
This gives us a
reference-free
measure of
coverage.

Digital normalization is
streaming

Digital normalization retains information, while
discarding data and errors

Digital normalization is
streaming error correction

Contig assembly now scales with underlying genome
size
 Transcriptomes, microbial genomes incl MDA, and
most metagenomes can be assembled in under 50
GB of RAM, with ~identical or improved results.

Victory! (?)

A few “minor” drawbacks…
1. Repeats are eliminated preferentially.
2. Genuine graph tips are truncated.
3. Polyploidy is downsampled.
4. It’s not clear what happens to polymorphism.
(For these reasons, we have been pursuing alternate
approaches.)
Partially discussed in Brown et al., 2012 (arXiv)

But still quite useful…
1. Assembling soil metagenomes.
Howe et al., PNAS, 2014 (w/Tiedje)
2. Understanding bone-eating worm symbionts.
Goffredi et al., ISME, 2014.
3. An ultra-deep look at the lamprey transcriptome.
Scott et al., in preparation (w/Li)
4. Understanding development in Molgulid ascidians.
Stolfi et al, eLife 2014; etc.

…and widely used (?)
Estimated ~1000 users of our software.
Diginorm algorithm now included in Trinity software
from Broad Institute (~10,000 users)
Illumina TruSeq long-read technology now
incorporates our approach (~100,000 users)

Part II: Wait, did you say
streaming?

Diginorm can detect graph
saturation

Graph saturation
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# high coverage read: do something clever!

“Few-pass” approach
By 20% of the way through 100x data set, more
than half the reads are saturated to 20x

(A) Streaming error detection for
metagenomes and transcriptomes
 Illumina has between 0.1% and 1% error rate.
 These errors confound mapping, assembly, etc.
(Think: what if you had error free reads? Life would be
much better)

Spectral error detection for genomes
Chaisson et al., 2009
True k-mers
Erroneous k-mers

Spectral error detection on
reads --
Error location!

…spectral error detection for reads =>
transcriptome, metagenome
True k-mers
Erroneous k-mers
Chaisson et al., 2009

Spectral error detection on
variable coverage data
How many of the errors can we pinpoint exactly?
f saturated Specificity Sensitivity
Genome 100% 71.4% 77.9%
Transcriptome 92% 67.7% 63.8%
Metagenome 96% 71.2% 68.9%
Real E. coli 100% 51.1% 72.4%

(B) Streaming error trimming for all shotgun
data
We can trim reads at first error.
f saturated error rate
total bases
trimmed
errors
remaining
Genome 100% 0.63% 31.90% 0.00%
Transcriptome
92% 0.65% 34.34% 0.07%
Metagenome
96% 0.62% 31.70% 0.04%
Real E. coli 100% 1.59% 12.96% 0.05%

(C) Streaming error correction
 Once you can do error detection and trimming on a
streaming basis, why not error correction?
 …using a new approach…

Streaming error correction of genomic, transcriptomic,
metagenomic data via graph alignment
Jason Pell, Jordan Fish, Michael Crusoe

Pair-HMM-based graph
alignment
Jordan Fish and Michael Crusoe

…a bit more complex...
Jordan Fish and Michael Crusoe

Error correction on simulated E.
coli data
TP FP TN FN
Streaming 3,494,631 3,865 460,601,171 5,533
(corrected) (mistakes) (OK) (missed)
1% error rate, 100x coverage.
Michael Crusoe, Jordan Fish, Jason Pell

A few additional thoughts --
 Sequence-to-graph alignment is a very general
concept.
 Could replace mapping, variant calling, BLAST,
HMMER…
“Ask me for anything but time!”
-- Napoleon Bonaparte

(D) Calculating read error rates
by position within read
 Shotgun data is randomly
sampled;
 Any variation in mismatches with
reference by position is likely due
to errors or bias.
Reads
Assemble
Map reads to
assembly
Calculate position-specific
mismatches

Sequencing run error profiles
Via bowtie mapping against reference --
Reads from Shakya et al., pmid 23387867

We can do this sub-linearly from data w/no
reference!
Reads from Shakya et al., pmid 23387867

Reference-free error profile
analysis
1. Requires no prior information!
2. Immediate feedback on sequencing quality (for cores
& users)
3. Fast, lightweight (~100 MB, ~2 minutes)
4. Works for any shotgun sample (genomic,
metagenomic, transcriptomic).
5. Not affected by polymorphisms.

Reference-free error profile
analysis
7. …if we know where the errors are, we can trim them.
8. …if we know where the errors are, we can correct them.
9. …if we look at differences by graph position instead of by
read position, we can call variants.
=> Streaming, online variant calling?

Future thoughts / streaming
How far can we take this?

Streaming approach supports more compute-intensive
interludes – remapping, etc.
Rimmer et al., 2014

Streaming online reference-free variant calling.
Single pass, reference free, tunable, streaming
online variant calling.

Streaming with reads…
Sequence...
Graph
Sequence...
Sequence...
Sequence...
Sequence...
Sequence...
Sequence...
Sequence...
....
Variants

Analysis is done after
sequencing.
Sequencing Analysis

Streaming with bases
k bases...
Graph
k+1
k bases... k+1
k+2
k bases... k+1
k bases... k+1
k bases... k+1
...
k bases... k+1
Variants

Integrate sequencing and
analysis
Sequencing
Analysis
Are we done yet?

Directions for streaming graph
analysis
 Generate error profile for shotgun reads;
 Variable coverage error trimming;
 Streaming low-memory error correction for genomes,
metagenomes, and transcriptomes;
 Strain variant detection & resolution;
 Streaming variant analysis.
Michael Crusoe, Jordan Fish & Jason Pell

Our software is open source
Methods that aren’t broadly available are limited in their
utility!
 Everything I talked about is in our github repository,
http://github.com/ged-lab/khmer
 …it’s not necessarily trivial to use…
 …but we’re happy to help.

Planned work: distributed graph database server
Web interface + API
Compute server
(Galaxy?
Arvados?)
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-talk.html

2014 anu-canberra-streaming

More Related Content

2014 anu-canberra-streaming

Editor's Notes