2014 moore-ddd

Infrastructure for Data
Intensive Biology
“Better Science through Superior Software”
C. Titus Brown

Current research:
Compressive algorithms for
sequence analysis
Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
Can we enable and accelerate sequence-
based inquiry by making all basic analysis
easier and some analyses possible?

Three super-awesome
technologies…
1. Low-memory k-mer counting
(Zhang et al., PLoS One, 2014)
2. Compressible assembly graphs
(Pell et al., PNAS, 2012)
3. Streaming lossy compression of sequence
data
(Brown et al., arXiv, 2012)

…implemented in one super-
awesome software package…
github.com/ged-lab/khmer/
BSD licensed
Openly developed using good practice.
> 10 external contributors.
Thousands of downloads/month.
50 citations in 3 years.
We think > 1000 people are using it; have heard
from dozens.

…enabling super-awesome
biology.
1. Assembling soil metagenomes
Howe et al., PNAS, 2014
2. Understanding bone-eating worm symbionts
Goffredi et al., ISME, 2014.
3. An ultra-deep look at the lamprey transcriptome
(in preparation)
4. Understanding derived anural development in
Molgulid ascidians (in preparation)

Early on, lack of replicability in pubs slowed us down =>
Strategy: “level up” the field
High quality & novel science,
done openly,
written up in reproducible and
remixable papers,
using IPython Notebook,
and posted to preprint servers.
Expression based
clustering of 85 lamprey
tissue samples (de novo
assembly of 3 billion reads)
~ 1 month
Camille Scott

Open protocols for the
cloud: ~$100/analysis
Read cleaning
Preprocessing
Assembly
Annotation
khmer-protocols.readthedocs.org/
Transcriptome and metagenome assembly protocols

The data challenge in biology
In 5-10 years, we will have nigh-infinite data.
(Genomic, transcriptomic, proteomic,
metabolomic, …?)
We currently have no good way of querying,
exploring, investigating, or mining these data
sets, especially across multiple locations..
Moreover, most data is unavailable until after
publication…
…which, in practice, means it will be lost.

Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)

Graph queries
assembled
sequence
nitrite
reductase
ppaZ
SIMILARITY TO ALSO CONTAINS
raw
sequence
across public & walled-garden data sets:
See Lee,
Alekseyenko, Brown,
paper in SciPy 2009:
the “pygr” project.

The larger vision
Enable and incentivize sharing by providing
immediate utility; frictionless sharing.
Permissionless innovation for e.g. new data
mining approaches.
Plan for poverty with federated infrastructure
built on open & cloud.
Solve people’s current problems, while
remaining agile for the future.

Who needs this?
Everyone.
Environmental microbiology, evo devo,
agriculture, VetMed...

How would I start?
1-2 pilot projects w/domain
postdocs: drive computational
infrastructure with biology
problems.
Support postdocs with
software engineer
(infrastructure) and graduate
student CS (research).
Cross-train postdocs in data-
intensive research methods
and software engineering.
Note: finding existing data is not a
problem :)
“DeepDOM” cruise: examination
of dissolved organic matter &
microbial metabolism vs
physical parameters – potential
collab.
Via Elizabeth Kujawinski

Education and training
Biology is underprepared for data-intensive
investigation.
We must teach and train the next generations.
~5-10 workshops / year, novice -> masterclass; open
materials.
Deeply self-interested:
What problems does everyone have, now?
(Assembly)
What problems do leading-edge researchers have?
(Data integration)

Pre-answered Questions
Q: What will be open?
A: Everything; I succeed & fail publicly.
Q: How will you measure success?
A: By other people using & extending our
“products” without talking to us.
Blog: ivory.idyll.org/blog/ - search for “moore”, “satire”
@ctitusbrown

Graph queries
across public & walled-garden data
sets:
“What data sets contain <this gene>?”
“Which reads match to <this gene>, but
not in <conserved domain>?”
“Give me relative abundance of <gene
X> across all data sets, grouped by
nitrogen exposure.”

2014 moore-ddd

More Related Content

2014 moore-ddd

Editor's Notes