Here are my talk notes for the Data Driven Discovery
grant competition ("cage match" round). Talk slides are on
You can see my full proposal here as
Hello, my name is Titus Brown, and I'm at Michigan State University
where I run a biology group whose motto is "better science through
superior software". I'm going to tell you about my vision for building
infrastructure to support data-intensive biology.
Our research is focused on accelerating sequence-based biology with
algorithmic prefilters that help scale downstream sequence analysis.
The basic idea is to take the large amounts of data coming from
sequencers and squeeze out the information in the data for downstream
software to use. In most cases we make things 10 or 100x easier, in
some cases we've been able to make analyses possible where they
weren't doable before;
In pursuit of this goal, we've built three super awesome computer
science advances: we built a low-memory approach for counting sequence
elements, we created a new data structure for low-memory graph storage,
and developed a streaming lossy compression algorithm that puts much
of sequence analysis on a online and streaming basis. Collectively,
these are applicable to a wide range of basic sequence analysis problems,
including error removal, species sorting, and genome assembly.
We've implemented all three of these approaches in an open source
analysis package, khmer, that we developed and maintain using good
software engineering approaches. We have primarily focused on using
this to drive our own research, but since you can do analyses with it
can't be done any other way, we've had some pretty good adoption by
others. It's a bit hard to tell how many people are using it because
of the many ways people can download it, but we believe it to be in the
1000s and we know we have dozens of citations.
The most important part of our research is this: we have enabled some
excellent biology! For a few examples, we've assembled the largest
soil metagenome ever with Jim Tiedje and Janet Jansson, we've helped
look at deep sea samples of bone eating worms with Shana Goffredi,
we're about to publish the largest de novo mRNASeq analysis ever done,
and we're enabling evo devo research at the phylogenetic extremes of the
Molgula sea squirts. This was really what we set out to do at the beginning
but the volume and velocity of data coming from sequencers turned out to be
the blocking problem.
Coming from a bit of a physics background, when I started working in
bioinformatics 6 years ago, I was surprised at our inability to
replicate others' results. One of our explicit strategies now is to
try to level up the field by doing high quality, novel, open science.
For example, our lamprey analysis is now entirely automated, taking
three weeks to go from 3 billion lamprey mRNASeq reads to an
assembled, annotated transcriptome that we can interactively analyze
in an IPython Notebook, which we will publish with the paper.
Camille, who is working on this, is a combination software engineer
and scientist, and this has turned out to be a really productive
We've also found that 1000s of people want to do the kinds of things
we're doing, but most don't have the expertise or access to
computational infrastructure. So, we're also working on open
protocols for analyzing sequence data in the cloud - going from raw
mRNASeq data to finished analysis for about $100. These protocols are
open, versioned, forkable, and highly replicable, and we've got about
20 different groups using them right now.
So that's what I work on now. But looking forward I see a really big
problem looming over molecular biology. Soon we will have whatever
'omic data set we want from, to whatever resolution we want, limited
only by sampling. But we basically don't have any good way of analyzing
these data sets -- most groups don't have the capacity or capability to
analyze them themselves, we can't store these data sets in one place
and -- perhaps the biggest part of the catastrophe -- people aren't
making these data available until after publication, which means that
I expect many of them to vanish. We need to incentivise
pre-publication sharing by making it useful to share your data. We can
do individual analyses now, but we're missing the part that links these
analyses to other data sets more broadly.
My proposal, therefore, is to build a distributed graph database system that
will allow people to interconnect with open, walled-garden, and private
data sets. The idea is that researchers will spin up their own server in the
cloud, upload their raw or analyzed data, and have a query interface that
lets them explore the data. They'll also have access to other public servers,
and be able to opt-in to exploring pre-published data; this opt-in will be
in the form of a walled-garden approach where researchers who use results
from analyzing other unpublished data sets will be given citation information
to those data sets. I hope and expect that this will start to incentivise
people to open their data sets up a bit, but to be honest if all it does
is make it so that people can analyze their own data in isolation it will
already be a major win.
None of this is really a new idea. We published a paper exploring some of
these ideas in 2009, but have been unable to find funding. In fact, I
think this is the most traditionally unfundable project I have ever proposed,
so I hope Moore feels properly honored. In any case, the main point here
is that graph queries are a wonderful abstraction that lets you set up
an architecture for flexibly querying annotations and data when certain
precomputed results already exist. The pygr project showed me the power
of this when implemented in a distributed way and it's still the best
approach I've ever seen implemented in bioinformatics.
The idea would be to enable basic queries like this across multiple
servers, so that we can begin to support the queries necessary for automated
data mining and cross-validation.
My larger vision is very buzzwordy. I want to enable frictionless
sharing, driven by immediate utility. I want to enable permissionless
innovation, so that data mining folk can try out new approaches
without first finding a collaborator with an interesting data set, or
doing a lot of prep work. By building open, federated infrastructure,
and avoiding centralized infrastructure, I am planning for poverty:
everything we build will be sustainable and maintainable, so when my
funding goes away others can pick it up. And my focus will be on
solving people's current problems, which in biology are immense,
while remaining agile in terms of what problems I tackle next.
The thing is, everybody needs this. I work across many funding agencies,
and many fields, and there is nothing like this currently in existence.
I'm even more sure of this because I posted my Moore proposal and requested
feedback and discussed it with a number of people on Twitter. NGS has
enabled research on non-model organisms but its promise is undermet due
to lack of cyberinfrastructure, basically.
How would I start? I would hire two domain postdocs who are tackling
challenging data analysis tasks, and support them with my existing
lab; this would involve cross-training the postdocs in data intensive
methodologies. For example, one pilot project is to work on the data
from the DeepDOM cruise, where they did multi-omic sampling across
about 20 points in the atlantic, and are trying to connect the dots
between microbial activity and dissolved organic matter, with
metagenomic and metabolomic data.
Integrated with my research, I would continue and expand my current
efforts in training. I already run a number of workshops and generate
quite a bit of popular bioinformatics training material; I would
continue and expand that effort as part of my research. One thing
that I particularly like about this approach is that it's deeply
self-interested: I can find out what problems everyone has, and will
be having soon, by working with them in workshops.
There are comments.