Building better metagenomics pipelines

I spend so much of my time writing stuff down to cadge funding or bruit about ideas, and much of that never really goes anywhere. In the interests of slowing down any competitors by getting them to take my old ideas seriously, here is an interesting set of ideas that I wrote up a few months ago with one particular funding body in mind.

I would welcome comments by scientists on whether or not the social ideas, below, would actually work. Remember, this is in the context of "no do-ey, no fund-ey".

(Basically, I'm trying to hack scientific culture the way ESR talks about hacking software culture. See my more general thoughts on this, too.)


We and others have a number of solutions that need to be carefully implemented with attention to both biological correctness and scale. Some specific ideas:

1) Methods for integrating metagenomic and metatranscriptomic data and eventually metaproteomic data, to identify and annotate genes. The goal is to enable the robust comparison of gene expression across conditions and environments.

The digital normalization approach developed by my lab allows us to combine both metagenome and metatranscriptome data from many different conditions for a maximally sensitive global assembly. Following this assembly we can then recover differentially expressed genes by looking at transcriptional levels in specific samples.

2) Correlation and difference analysis of large metagenomic data sets. Specifically, enable us to query for presence/absence/abundance across metagenomic and metatranscriptomic shotgun data sets from a vast (1000s-100s of 1000s) number of samples, and extract gene presence/absence and expression level profiles from that data. Our lab has developed the ability to do this very sensitively at the genomic level, which would be a nice complement to protein-based techniques.

For example, we can see a ~50% genomic overlap between the raw reads from Iowa prairie and Iowa corn soil samples, indicating that a substantial portion of the underlying genomes are shared. This is a general approach that would let us compare and contrast microbial communities without passaging data through the very biased filter of assembly.

The underlying technology already exists, but scaling it up so that we can do ongoing comparisons of thousands or (eventually) millions of samples, and providing a flexible query system on top of it, is a significant challenge.

3) Assembly-graph-based exploration of complex data sets. It is quite likely that we are failing to assemble highly variable regions from complex metagenomes, and it should be straightforward to use partitioning to isolate, detect and analyze such regions.

4) Annotation evaluation. Virtually everyone expresses frustration with the current genome annotation pipelines. I propose to develop methods for evaluating annotations for environmental (meta)genomes so that annotation pipelines and assembly strategies can be compared more objectively.

Social aspects:

The human aspect is as important as functioning technology, I think, and to address failings with existing metagenome pipelines, I would suggest the following "cross cutting" efforts:

1) Develop all software in tight collaboration with labs who already have data and are attempting to answer specific biological problems; this ensures that the software is relevant. (Details of current collaborations omitted.)

2) Implement everything as open source software, with installation instructions for the cloud, and with a simple Galaxy Web interface on top, with cloud execution capability. (This is the model we're using for our current khmer software.)

This addresses a number of problems with the way current software pipelines in metagenomics are written: first, we can provide a standalone, tested, published, and compartmentalized workflow component that can be reused by technically savvy individuals as well as big workflow engines (e.g. MG-RAST and kbase). This answers the concern that much of pipeline software is ad hoc and untested in isolation. Second, we provide a simple Web interface that lets less technically capable people use the software with predefined parameters. This enables scientists to work with the software in standalone mode. And third, the component can be executed locally (on e.g. HPC systems), via a public or private pipeline (e.g. MG-RAST), or directly on rental compute (the Amazon cloud), which provides the broadest possible set of options for Cyberinfrastructure use.

3) Provide training through targeted workshops on environmental metagenomics. I run an in-depth workshop on general NGS data analysis during the summer, and feedback from that course has been very positive: students report that they are much more capable after the fact. (See for a writeup). Last year I participated in the STAMPS course at MBL which had a combined 16s+shotgun focus.

I would propose to run a more targeted ~1 week workshop on environmental shotgun metagenomics using Illumina and other "deep sampling" strategies, organized along the lines of the NGS summer course.

Comments !