2015 balti-and-bioinformatics

Can we get scientists
to share data through
self-interest?
C. Titus Brown
UC Davis
ctbrown@ucdavis.edu

Thanks, Nick!
This is an attempt to explain why I pitched this:
http://ivory.idyll.org/blog/2014-moore-ddd-talk.html
and talk about what I’d like to do with the money.

The way data commonly
gets published
Gather data Analyze data Write paper
Publish paper
and data

Many failure modes:
Publish paper
and dataX
Lack of expertise;
Lack of tools;
Lack of compute;
Bad experimental design;

Many failure modes:
Publish paper
and dataX
(The usual reasons)

One failure mode in
particular:
Publish paper
and data
Other data
X

One failure mode in
particular:
Publish paper
and data
Other data
X
Lots of biological data doesn’t make
sense, except in the light of other
data.
This is especially true in two of the
fields I work in, environmental
metagenomics and non-model
mRNAseq

(For example: gene
annotation by homology)
Anything else Mollusc Cephalopod
no similarity

Hmm.
Data
publication
Data
publication
Data
analysis
Data
analysis

I believe:
There are many interesting and useful data sets
immured behind lab walls by lack of:
• Expertise
• Tools
• Compute
• Well-designed experimental setup
• Pre-analysis data publication culture in biology
• Recognition that sometimes hypotheses just get in
the way
• Good editorial judgment

(Side note)
The existence of journals that will let you publish
virtually anything should have really helped data
availability!
Sadly, many of them don’t enforce data publication
rules.

Data publications!
The obvious solution: data pubs!
(“Pre-publication data sharing”)
Make your data available so that others can cite it!
GigaScience, Data Science, etc.
…but we don’t yet reward this culturally in biology.
(True story: no one cares, yet.)
I’m actually uncertain myself about how much we should
reward data and source code pubs. But we can talk later.

Pre-publication data
sharing?
There is no obvious reason to make data available prior to
publication of its analysis.
There is no immediate reward for doing so.
Neither is there much systematized reward for doing so.
(Citations and kudos feel good, but are cold comfort.)
Worse, there are good reasons not to do so.
If you make your data available, others can take advantage of
it…
…but they don’t have to share their data with you in order to do
so.

This bears some similarity
to the Prisoners’ Dilemma:
http://www.acting-man.com/?p=34313
“Confession” here is not
sharing your data.
Note: I’m not a game
theorist (but some of my
best friends are).

So, how do we get academics
to share their data!?
Two successful “systems” (send me more!!)
1. Oceanographic research
2. Biomedical research

1. Research cruises are
expensive!
In oceanography,
individual researchers cannot
afford to set up a cruise.
So, they form scientific consortia.
These consortia have data
sharing and preprint sharing
agreements.
(I’m told it works pretty well (?))

2. Some data makes more sense
when you have more data
Omberg et al., Nature Genetics, 2013.
Sage Bionetworks et al.:
Organize a consortium to
generate data;
Standardize data generation;
Share via common platform;
Store results, provenance,
analysis descriptions, and source
code;
Run a leaderboard for a subset of
analyses;
Win!

This “walled garden”
model is interesting!
“Compete” on analysis, not on data.

Some notes -
• Sage model requires ~similar data in common
format;
• Common analysis platform then becomes
immediately useful;
• Data is ~easily re-usable by participants;
• Publication of data becomes straightforward;
• Both models are centralized and coordinated.

The $1.5m question(s):
• Can we “port” this sharing model over to
environmental metagenomics, non-model
mRNAseq, and maybe even VetMed and
agricultural research?
• Can we use this model to drive useful pre-
publication data sharing?
• Can we take it from a coordinated and centralized
model to a decentralized model?

A slight digression -
Most data analysis models are based on centralizing data
and then computing on it there. This has several failure
points:
• Political: expect lots of biomedical, environmental data
to be restricted geopolitically.
• Computation: in the limit of infinite data…
• Bandwidth: in the limit of infinite data…
• Funding: in the limit of infinite data…

Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)

Graph queries
across public & walled-garden data sets:
See Lee, Alekseyenko,
Brown, 2009, SciPy
Proceedings: ‘pygr’
project.
raw sequence
assembled
sequence
nitrite reductase ppaZ
SIMILAR TO ALSO CONTAINS

Graph queries
across public & walled-garden data sets:
“What data sets contain <this gene>?”
“Which reads match to <this gene>, but not in
<conserved domain>?”
“Give me relative abundance of <gene X>
across all data sets, grouped by nitrogen
exposure.”

Thesis:
If we can provide immediate returns for data sharing,
researchers will do so, and do so immediately.
Not to do so would place them at a competitive
disadvantage.
(All the rest is gravy: open analysis system,
reproducibility, standardized data format, etc.)

Puzzle pieces.
1. Inexpensive and widely available cloud computing
infrastructure?
Yep. See Amazon, Google, Rackspace, etc.

Puzzle pieces.
2. The ability to do many or most sequence analyses
inexpensively in the cloud?
Yep. This is one reason for khmer & khmer-protocols.

Puzzle pieces.
3. Locations to persist indexed data sets for use in
search & retrieval?
figshare & dryad (?)

Puzzle pieces.
4. Distributed data mining approaches?
Some literature, but I know little about it.

In summary:
How will we do this?
I PLAN TO FAIL.
A LOT.
PUBLICLY.
(ht @ethanwhite)

In summary:
How will we know if (or when) we’ve “won”?
1. When people use, extend, and remix our software
and concepts without talking to us about it first.
(c.f. khmer!)
2. When the system becomes so useful that people go
back and upload old data sets to it.

In summary:
The larger vision
Enable and incentivize sharing by providing
immediate utility; frictionless sharing.
Permissionless innovation for e.g. new data
mining approaches.
Plan for poverty with federated infrastructure
built on open & cloud.
Solve people’s current problems, while
remaining agile for the future.

Thanks!
References and pointers welcome!
https://github.com/ged-lab/buoy
(Note: there’s nothing there yet.)

2015 balti-and-bioinformatics

Related slideshows

More Related Content

2015 balti-and-bioinformatics

Editor's Notes