I'm giving a talk at XLDB 2012 tomorrow, and I thought I'd post a bunch of accompanying links and discussion, since this audience is pretty far away from my normal audience ;).
Here's the talk itself, on slideshare: Streaming and Compression Approaches for Terascale Biological Sequence Data Analysis
Acknowledgements (slide 3)
The work I'm talking about today was done by a collection of people in the lab that includes Adina Howe, a postdoc; Jason Pell, a graduate student; Arend Hintze, a former postdoc; Rosangela Canino-Koning, a former graduate student; Qingpeng Zhang, a graduate student; and Tim Brom, a former graduate student. Much of the metagenomics work was driven by Adina, and the compressible graph representation work was primarily done by Jason. Qingpeng and Tim work(ed) on k-mer counting and digital normalization, among many other things.
The data I'm talking about was generated as part of the Great Prairie Grand Challenge project; the collaborators responsible for the data generation are Jim Tiedje, Janet Jansson, and Susannah Tringe. The sequencing was done by JGI. My funding comes from the USDA for sequence analysis tools, a small NSF grant for metagenomics work, and the BEACON NSF Center for the study of evolution in action.
Open Science (slide 5)
Our code is at github.com/ged-lab/.
You're at my blog :).
I've posted a number of grants and pubs on the lab Web site at http://ged.msu.edu/interests.html
Our PNAS paper on probabilistic de Bruijn graphs is Pell et al, 2012.
Our digital normalization paper is available through arXiv.
You can see a longer, more technical version of this talk from a presentation at U. Arizona over at slideshare, and when I gave the same basic talk at MSU last week, they recorded it and I put it on YouTube.
Soil is full of uncultured microbes (slides 6 and 7)
This is observed by many people; Gans et al. (Science, 2005) used DNA reassociation kinetics to estimate that soil contained upwards of a million distinct genomes. Fierer and Jackson (PNAS, 2006) talk about the biogeography of soil -- what environments, where, are how diverse? -- and there are several other papers worth reading, including Dunbar et al. (AEM, 2002) and Tringe et al. (Science, 2005).
The "collector's curve" on slide 6 shows the increase in the number of "Operational Taxonomic Units", or OTUs -- a species analog, but for uncultured critters -- as you sequence the 16s marker gene more and more deeply from mixed soil communities. If this graph showed saturation it would imply that diversity was capped at that point on the y axis. (It's not.)
Shotgun metagenomics and assembly (slides 8-13)
Shotgun sequencing and assembly of uncultured microbes is one of the things I work on. Miller et al. (Genomics, 2010) is one of the best papers I've seen for explaining assembly with de Bruijn graphs.
Conway and Bromage (Bioinformatics, 2011) is the reference I use to point out that errors in sequencing cause de Bruijn graph size to increase linearly with sequencing depth.
Sequencing is becoming Big Data (slide 14)
The infamous sequencing costs slide shows that sequencing capacity has been growing faster than Moore's Law for the last few years. The point I like to make in connection with this is that this data generation capacity now extends to many very small labs -- these sequencers are neither that expensive nor that tricky to operate -- so many small labs are now generating vast quantities of data.
I wrote a blog post pointing out that when your data generation capacity is scaling faster than Moore's Law, cloud computing is not the long-term solution -- you need better algorithms.
Soil sequencing is particularly obnoxious (slide 15)
It's a straightforward calculation to go from robustly observing species at a 1-in-a-million dilution to pointing out that shotgun sequencing of those 1-in-a-million critters will require about 50 terabases of sequencing:
If you assume each microbial genome is 5 mb, and you need to randomly sample each genome to (at least) 10x for assembly, then you need 50 mb x 1 million to robustly sample your rarest genome.
With current assembly approaches that would require about 500 TB of RAM on a single chassis, partly because of errors, and partly because of the true underlying diversity of the sample.
Probabilistic de Bruijn graphs (slides 16-22)
This goes over the approach we just published in Pell et al, (PNAS, 2012) for partitioning metagenomes using low-memory data structures.
See my story behind the paper blog post, too.
Online, streaming, lossy compression (slides 23-27)
This goes over the digital normalization approach that's in preprint form at arXiv.
What do we assemble? (slide 28)
We get a lot of assembled microbial sequences out of our soil samples. We have never seen most of 'em.
Physics ain't biology (slides 30-39)
This is discussed in more detail in this blog post. I need to write a follow up...
Three anecdotes that I want to give during the talk:
First, one of the (several) reasons I'm in biology now is because of Hans Bethe, the well known physicist. We were talking over dinner (my father and Hans collaborated for many years) and I asked him what career advice he had for young scientists. He responded that were he starting in science today, he would go into biology, as it had the most opportunities going forward.
Second, the Endomesoderm Network diagram (http://sugp.caltech.edu/endomes/) started out as an immensely useful map of everything that was known; for the first decade or so, it focused on laying out the necessary genes. Until it was reasonably complete, there was no point in trying to model it or use it to discuss sufficiency. I think the Davidson Lab has been moving in this direction, although I'm not sure -- I left in 2006. My main points with the diagram are that it's immensely complicated in terms of interactions; it took a decade or more to put together; it's for only one organism; and there's limited ability to use it for modeling until you know the majority of the interactions.
Third, my experience in collaborating with physicists is that even if I'm not as smart as my collaborators, I often know way more biology at both an explicit and an intuitive level, and this is a useful contribution. Physicists go right for the throat of problems and become monofocused, and sometimes it's nice to remind them that we don't know everything, but what we do know suggests that maybe something else is going on too, etc. etc.