Published: Sat 10 May 2014
By C. Titus Brown
In science .
tags: research scifi moore
My second-round Data Driven Discovery
application is due on Monday, and my first draft contained the
following story. I don't think I'll include it in the actual
application, but it was entertaining enough to write that I thought
I'd post it here.
A vision of the future I would like to enable
In 2020, a second-year graduate student named Lucy will take her newly
generated city metagenome / transcriptome / proteome data sets and run her
first round of analyses. These analyses will compare her data to
known genes and genomes, and bring those functional and genomic
annotations into her workspace for further manipulation. The system
will also compare her data against all of the public metagenomes,
alerting Lucy to similarities with a number of other city and environmental
data sets.
Metadata will also be scanned for correlations between various
environmental parameters and features of the data set. Metabolic
models will automatically be run and unusual features of the data set
(unusual pathways; missing pathway components; unusual signatures
of positive selection)
will be highlighted for later inspection. On top of
this basic set of analyses, the system will provide a set of
interactive exploration and visualization tools that Lucy can use to
adjust parameters, dig into details, and link to additional services
that may be of use. These first analyses will be the first half of
Lucy's data investigation, leaving her, her advisor, and her
collaborators to ask the intelligent questions and follow up on
specific analyses.
Lucy may neither know nor particularly care that these analyses are
being run using a mixture of local campus computing, cloud computing
paid for by her grant funding, and government-funded standing database
services. It will, at first blush, be unimportant that each of these
analyses is run using a well documented and completely open set of
algorithms, workflows, data sets, and software. She will be able to
share her results with her collaborators and publish them to an
archival location with a single click, but, again, the details will
not matter to her; she will simply have a DOI to place in her paper.
Because the entire workflow will be open, versioned, and published,
reproducibility will only be a challenge when she uses a custom
algorithm developed by her collaborators; however, because her work is
all done in a literate research environment like IPython Notebook, the
commands and parameters executed will be logged as a matter of course.
Lucy's data will also be posted to public databases as soon as basic
quality control is done, because her funding organization requires
this. While the data will be marked as part of an unpublished
investigation, the data itself will be citable. Lucy and her advisor
will not mind, because she is using many other people's data under
similar conditions (and she also understands that data itself is not
particularly interesting, since her data generation cost well
under 5% of the total budget).
Lucy will also have the option of setting up a wide variety of alerts,
enabling periodic searches to be done against new data sets for
genomic overlap, similar annotation patterns, or new biological
annotations. She will be able to repopulate her analyses
with these new results automatically, as will anyone working with her
analyses or (eventually) looking at her published study. As new
algorithms and databases come on line, computational researchers will
be able to alert Lucy to potentially new and impactful results that
might deepen her research conclusions.
--titus
There are comments .