2015 genome-center

A data intensive future:
How can biology take full advantage of
the coming data deluge?
C.Titus Brown
School ofVeterinary Medicine;
Genome Center & Data Science Initiative
11/13/15

Outline
0. Background
1. Research: what do we do with infinite data?
2. Development: software and infrastructure.
3. Open science & reproducibility.
4. Training

0. Background
In which I present the perspective that we face
increasingly large data sets, from diverse
samples, generated in real time, with many
different data types.

DNA sequencing rates continues
to grow.
Stephens et al., 2015 - 10.1371/journal.pbio.1002195

Oxford Nanopore sequencing
Slide viaTorsten Seeman

Nanopore technology
Slide viaTorsten Seeman

“Fighting EbolaWith a Palm-
Sized DNA Sequencer”
See: http://www.theatlantic.com/science/archive/2015/09/ebola-
sequencer-dna-minion/405466/

“DeepDOM” cruise: examination
of dissolved organic matter &
microbial metabolism vs physical
parameters – potential collab.
Via Elizabeth Kujawinski

1. Research
In which I discuss advances made towards
analyzing infinite amounts of genomic data, and
the perspectives engendered thereby: to whit,
streaming and sketches.

Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
De Bruijn graphs (sequencing graphs) scale with
data size, not information size.

Why do sequence graphs scale
badly?
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set

Practical memory measurements
Velvet measurements (Adina Howe)

Our solution: lossy compression

Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.

Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)

Graph sizes now scales with information content.
Most samples can be reconstructed via de
novo assembly on commodity computers.

Diginorm ~ “lossy compression”
Nearly perfect from an information theoretic
perspective:
– Discards 95% more of data for genomes.
– Loses < 00.02% of information.

This changes the way analyses
scale.

Streaming lossy compression:
for read in dataset:
if estimated_coverage(read) < CUTOFF:
yield read
This is literally a three line algorithm. Not kidding.
It took four years to figure out which three lines, though…

Diginorm can detect information
saturation in a stream.
Zhang et al., submitted.

This generically permits semi-streaming
analytical approaches.

e.g. E. coli analysis => ~1.2 pass, sublinear
memory

Another simple algorithm.

Single pass, reference free, tunable, streaming online
variant calling.
Error detection  variant calling

Real time / streaming data
analysis.
Raw data
(real time, from
sequencer?)
Error trimming
Variant calling
De novo
assembly

My real point -
• We need well founded, and flexible, and algorithmically
efficient, and high performance components for
sequence data manipulation in biology.
• We are building some of these on a streaming and low
memory paradigm.
• We are building out a scripting library for composing
these operations.

2. Software and infrastructure
Alas, practical data analysis depends on
software and computers, which leads to
depressingly practical considerations for
gentleperson scientists.

Software
It’s all well and good to develop new data
analysis approaches, but their utility is greater
when they are implemented in usable software.
Writing, maintaining, and progressing research
software is hard.

The khmer software package
• Demo implementation of research data structures &
algorithms;
• 10.5k lines of C++ code, 13.7k lines of Python code;
• khmer v2.0 has 87% statement coverage under test;
• ~3-4 developers, 50+ contributors, ~1000s of users (?)
The khmer software package, Crusoe et al., 2015. http://f1000research.com/articles/4-900/v1

khmer is developed as a true open
source package
• github.com/dib-lab/khmer;
• BSD license;
• Code review, two-person sign off on changes;
• Continuous integration (tests are run on each
change request);

Challenges:
Research vs stability!
Stable software for users, & platform for future
research;
vs research “culture”
(funding and careers)

How is continued software dev feasible?!
Representative half-arsed lab software development
Version that
worked once, for
some publication.
Grad student 1
research
Grad student 2
research
Incompatible and broken code

A not-insane way to do software development
Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests

Infrastructure issues
Suppose that we have a nice ecosystem of bioinformatics &
data analysis tools.
Where and how do we run them?
Consider:
1. Biologists hate funding computational infrastructure.
2. Researchers are generally incompetent at building and
maintaining usable infrastructure.
3. Centralized infrastructure fails in the face of infinite data.

Decentralized infrastructure for
bioinformatics?
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-award.html

3. Open science and
reproducibility
In which I start from the point that most
researchers* cannot replicate their own
computational analyses, much less reproduce
those published by anyone else.
*This doesn’t apply to anyone in this
audience; you’re all outliers!

My lab & the diginorm paper.
• All our code was on github;
• Much of our data analysis was in the cloud (on
Amazon EC2);
• Our figures were made in IPython Notebook.
• Our paper was in LaTeX.
Brown et al., 2012 (arXiv)

IPython Notebook: data + code =>
IPython)Notebook)

To reproduce our paper:
git clone <khmer> && python setup.py install
git clone <pipeline>
cd pipeline
wget <data> && tar xzf <data>
make && cd ../notebook && make
cd ../ && make

This is standard process in lab --
Our papers now have:
• Source hosted on github;
• Data hosted there or onAWS;
• Long running data analysis =>
‘make’
• Graphing and data digestion =>
IPython Notebook (also in
github)
Zhang et al. doi: 10.1371/journal.pone.0101271

Research process
Generate new
results; encode
in Makeﬁle
Summarize in
IPython
Notebook
Push to githubDiscuss, explore

Literate graphing & interactive
exploration
Camille Scott

Why bother??
“There is no scientific knowledge of the individual.”
(Aristotle)
More pragmatically, we are tired of struggling to
reproduce other people’s results.
And, in the end, it’s not all that much extra work.

What does this have to do with
open science?
This is a longer & larger conversation, but:
All of our processes enable easy and efficient pre-publication
sharing. Source code, analyses, preprints…
When we share early, our ideas have a significant competitive
advantage in the research marketplace of ideas.

4.Training
In which I note that methods and tools do little
without a trained hand wielding them, and a
trained eye examining the results.

Perspectives on training
• Prediction: The single biggest challenge
facing biology over the next 20 years is the
lack of data analysis training (see: NIH DIWG
report)
• Data analysis is not turning the crank; it is an
intellectual exercise on par with
experimental design or paper writing.
• Training is systematically undervalued in
academia (!?)

UC Davis and training
My goal here is to support the coalescence and
growth of a local community of practice around
“data intensive biology”.

Summer NGS workshop (2010-2017)

General parameters:
• Regular intensive workshops, half-day or longer.
• Aimed at research practitioners (grad students & more
senior); open to all (including outside community).
• Novice (“zero entry”) on up.
• Low cost for students.
• Leverage global training initiatives.

Thus far & near future
~12 workshops on bioinformatics in 2015.
Trying out soon:
• Half-day intro workshops;
• Week-long advanced workshops;
• Co-working hours.
dib-training.readthedocs.org/

The End.
• If you think 5-10 years out, we face significant practical
issues for data analysis in biology.
• We need new algorithms/data structures, AND good
implementations, AND better computational practice,
AND training.
• This can be either viewed with despair… or seen as an
opportunity to seize the competitive advantage!
(How I view it varies from day to day.)

Thanks for listening!
Please contact me at ctbrown@ucdavis.edu!
Note: I work here now!

2015 genome-center

More Related Content

2015 genome-center

Editor's Notes