My second-round Data Driven Discovery application is due on Monday, and my first draft contained the following story. I don't think I'll include it in the actual application, but it was entertaining enough to write that I thought I'd post it here.
A vision of the future I would like to enable
In 2020, a second-year graduate student named Lucy will take her newly generated city metagenome / transcriptome / proteome data sets and run her first round of analyses. These analyses will compare her data to known genes and genomes, and bring those functional and genomic annotations into her workspace for further manipulation. The system will also compare her data against all of the public metagenomes, alerting Lucy to similarities with a number of other city and environmental data sets. Metadata will also be scanned for correlations between various environmental parameters and features of the data set. Metabolic models will automatically be run and unusual features of the data set (unusual pathways; missing pathway components; unusual signatures of positive selection) will be highlighted for later inspection. On top of this basic set of analyses, the system will provide a set of interactive exploration and visualization tools that Lucy can use to adjust parameters, dig into details, and link to additional services that may be of use. These first analyses will be the first half of Lucy's data investigation, leaving her, her advisor, and her collaborators to ask the intelligent questions and follow up on specific analyses.
Lucy may neither know nor particularly care that these analyses are being run using a mixture of local campus computing, cloud computing paid for by her grant funding, and government-funded standing database services. It will, at first blush, be unimportant that each of these analyses is run using a well documented and completely open set of algorithms, workflows, data sets, and software. She will be able to share her results with her collaborators and publish them to an archival location with a single click, but, again, the details will not matter to her; she will simply have a DOI to place in her paper. Because the entire workflow will be open, versioned, and published, reproducibility will only be a challenge when she uses a custom algorithm developed by her collaborators; however, because her work is all done in a literate research environment like IPython Notebook, the commands and parameters executed will be logged as a matter of course.
Lucy's data will also be posted to public databases as soon as basic quality control is done, because her funding organization requires this. While the data will be marked as part of an unpublished investigation, the data itself will be citable. Lucy and her advisor will not mind, because she is using many other people's data under similar conditions (and she also understands that data itself is not particularly interesting, since her data generation cost well under 5% of the total budget).
Lucy will also have the option of setting up a wide variety of alerts, enabling periodic searches to be done against new data sets for genomic overlap, similar annotation patterns, or new biological annotations. She will be able to repopulate her analyses with these new results automatically, as will anyone working with her analyses or (eventually) looking at her published study. As new algorithms and databases come on line, computational researchers will be able to alert Lucy to potentially new and impactful results that might deepen her research conclusions.