So, there's this fairly large collection of about 700 RNAseq samples,
from 300 species in 40 or so phyla. It's called the Marine Microbial
Eukaryotic Transcriptome Sequencing Project (MMETSP), and was funded
by the Moore Foundation as a truly field-wide collaboration to improve
our reference collection for genes (and more). Back When, it was
sequenced and assembled by the National Center for Genome Resources, and published in PLOS Biology (Keeling et
al., 2014).
Partly because we think assembly has improved in the last few years,
partly as an educational exercise, partly as an infrastructure
exercise, partly as a demo, and partly just because we can, Lisa Cohen
in my lab is starting to reassemble all of the data - starting with
about 10%.
She has some of the basic evaluations (mostly via transrate) posted, and before we pull the
trigger on the rest of the assemblies, we're pausing to reflect and to
think about what metrics to use, and what kinds of resources we plan
to produce. (We are not lacking in ideas, but we might be lacking in
good ideas, if you know what I mean.)
In particular, this exercise raises some interesting questions that we
hope to dig into:
- what does a good transcriptome look like, and how could having 700
assemblies help us figure that out? (hint: distributions)
- what is a good canonical set of analyses for characterizing transcriptome
assemblies?
- what products should we be making available for each assembly?
- what kind of data formatting makes it easiest for other bioinformaticians
to build off of the compute we're doing?
- how should we distribute the workflow components? (Lisa really likes shell
scripts but I've been lobbying for something more structured. 'make' doesn't
really fit the bill here, though.)
- how do we "alert" the community if and when we come up with better
assemblies? How do we merge assemblies between programs and efforts,
and properly credit everyone involved?
Anyway, feedback welcome, here or on Lisa's post!
We are happy to share methods, data, analyses, results, etc. etc.
--titus
- p.s. Yes, that's right. I ask new grad students to start by
- assemblying 700 transcriptomes. So? :)
There are comments.