I'm pleased to announce the publication of "Tackling soil diversity
with the assembly of large, complex metagenomes", by Adina Howe, Janet
Jansson, Stephanie Malfatti, Susannah Tringe, James Tiedje, and
myself. The paper is openly available on the PNAS Web site here
This paper is a milestone for me personally, because it's one of the
two projects that drove my first five years of research. This is the
project that led us to develop scalable k-mer counting with the khmer toolkit; figure out compressible De Bruijn
graphs and partitioning;
and ultimately led to digital normalization.
The story starts way back in ~2009, when Adina Howe arrived in my lab.
She had applied for an NSF postdoctoral fellowship in bioinformatics
to work between Jim Tiedje and me. (Adina
endearingly refers to me as her "less famous advisor.") Over the
first few weeks of her time in our labs, we decided that we would
fling Adina at the Great Prairie Grand Challenge,
an effort to apply the newly available Illumina sequencing technology
to the characterization of soil. I had some previous experience with
metagenomics from grad school, and somewhat
to my surprise found myself the local expert on shotgun metagenomics,
so I thought, what the heck?
We then proceeded to fail repeatedly for two years.
The fundamental challenge of this project lay at the intersection of
two problems: first, the large volume of data produced by Illumina
made most obvious approaches fail; and second, soil is the most
diverse and rich environment on the planet, exceeding the complexity
of the human genome by 100-1000 fold, meaning most obvious approaches
What did we try?
First, we tried BLASTing the reads against nr. This was frustrating
on several levels. We had about 500m 2x76 bp reads from the Illumina
GAII machine, and it took a really long time to search these against a
large database like nr. Most reads didn't match, and, when they did,
they were very short matches that we feared were non-specific. After
a pain-filled few months of BLASTing, we decided we needed to get longer
contigs out of these. Cue the decision to attempt assembly!
Now, we faced the problem of actually completing an assembly. With
the diversity and volume of our data, most computers just ...broke. We
spent a lot of time crashing our lab computers, to no avail.
Since the problem was the volume of data, we decided to try
eliminating low-coverage reads by trimming them at low-abundance
k-mers. The logic was that we only expected high-abundance things to
assemble anyway, so if we stopped feeding low-abundance data into the
assembler maybe we could scale the whole process up. This approach had
been applied to several complex microbial communities, including
permafrost soil and
the rumen microbiome. Tallymer (the only
k-mer counter available at the time) also crashed on the volume of
data, and didn't seem that useful for trimming, either. This led us
to adapt khmer to low-memory k-mer
counting and read trimming -- you can see one of my first scientific
blog posts on that, here.
Once we trimmed the data, we fed it into an assembler and ... got
nothing. Turns out when you have shallow sequencing of an incredibly
diverse community, everything is low coverage; eliminating
everything with low coverage means you eliminate all your data!
(You may be noticing the "failure" theme here. Yes, it was a slightly
Then came the Terabase Metagenomics meeting, a gathering of
computational folks and microbiologists interested in environmental
microbiology. This meeting convinced me that absolutely no one had a
solution -- all of the people working on large metagenomes were stuck just
like us. This was an important step mentally, because it
convinced me that this project was worth some real effort. Another
important aspect of this meeting was that the one of the groups that
claimed they had a partial solution was unwilling to share their
code with us (despite us being, in theory, collaborators). So I
mentally said, "screw 'em, I guess we'll have to go this alone."
Discussions at this meeting and follow-on conversations in my lab then
led to the development of the compressible De Bruijn graph
implementation featured here.
The conversations went something like this: "We should be able to
subdivide the data by walking from k-mer to k-mer." "How can we do
that?" "I don't know." "How about..." <repeated banging of head
against table followed by trial implementation> "That doesn't scale,
does it?" "No, shh, the k-mers will hear you." <repeat from beginning>
We eventually solved the problem (see our paper).
Entertainingly, the paper that came out of that work happened because I
gave the task of determining the limits of our k-mer graph
storage to Jason Pell. I specifically remember waving my hands at him, telling
him to figure it out, and
saying "You might look at percolation theory. I bet there's a
critical point in there somewhere." He then went off and figured it
out, taking about 6 months to become well versed in percolation theory
and associated simulations. True story. Mentorship at its best!
Once that worked we could assemble the Illumina GAII data by breaking it up into
disconnected partitions, or graph components. Huzzah! We got
about a 20x scaling factor out of this, so we could now take data that
would ordinarily require a maximum of 300 GB of RAM and work with it
in under 60 GB of RAM. Of course, in the meantime, Illumina had come out
I vividly remember the phone call with our JGI collaborators where we
announced our partitioning approach. I said, "Hey, so we can assemble
all 500m reads!" And they said, "Great! We just brought our HiSeqs on
line -- here's another 2 billion reads. Does your solution scale to that?"
The answer was no, our solution did not scale to that. The reason
we'd needed to assemble in ~60 GB of RAM was that Amazon m4.xlarge
machines had 68 GB of RAM; we didn't have access to 300 GB machines.
I became rather depressed at this point. Because of Illumina's damned
data generation capacity, we'd spent about two years coming up with
solutions that made us fall slightly slower behind than everyone else.
This didn't seem like a good way to do bioinformatics research.
The fundamental problem, it seemed, was that even when we could break
things up into partitions, the partitions would be getting bigger and
bigger as we sequenced more deeply; this led to more data to partition,
which was the blocking step.
Banging our head against this occupied our next few months. The key
insight we eventually reached was that even though the partitions
were growing larger, at some point the information content was
saturated. We just needed to stop collecting data at saturation!
This led to digital normalization, a sublinear algorithm for
discarding redundant data, which turns out to be pretty widely useful
in assembly and beyond.
The story doesn't quite end there, but you can read the rest of it in
the paper we just published. Our lab could now assemble large,
complex metagenomes fairly easily (although we still have a ways to go
towards the volume of data we think we need).
Adina spent a solid year or more analyzing assemblies and benchmarking
our approaches, and this paper is the result.
Challenges in publication
We submitted the paper to a Glamour Mag, and got rejected without
review. I inquired as to why, and was told, essentially, that "we
asked several reviewers informally and they said there wasn't anything
new in the methods." This was because we'd published the partitioning
approach already, and had submitted the diginorm paper separately as
well. I personally thought it would be a plus that we'd evaluated the
methods thoroughly before applying them to an unknown data set, but
apparently others felt differently.
Upon submission to PNAS, one reviewer likewise commented that they
were unsure of the distinction between our various papers on the topic
of assembly and this new paper. We tried to clarify this in the
paper, and ultimately the reviewer accepted our argument that we had
neither evaluated diginorm on metagenomes, nor shown that partitioning
led to species bins, in previous papers -- note this is separate from
the argument that, hey, we can do bigger assemblies than ever before.
Based on these experiences, I think this paper fell directly into the
"novelty squared" problem identified by Josh Bloom: it's not enough to
apply validated methods to cool new data, the methods themselves
have to be awesome and novel as well. When combined with the space
crunch of glam mags, this inevitably leads to bad science -- either the methods
(generally) or the data implications (rarely) are shorted in the final
If you've made it this far, here are my personal take home messages from
First, our combined partitioning+diginorm approach results in good
sensitivity and specificity when applied to known communities. This
mirrors what we've heard from many others who are now using the
Second, binning metagenome reads by De Bruijn graph partitions results
in species- or strain-level bins of reads. That was our intuition at the
time, and it's nice to see it born out.
Third, assembling dramatically improves the detection signal for
similarity searches (see figure S5). Since this was our original
rationale for assembly it's nice to see it born out. My pithy take
home message here is, Friends don't let friends annotate single
Fourth, we need higher coverage. We only assembled 20% of the data,
and it seems likely (based on our coverage plots; see Fig 2) that this
is due to low coverage. Give us 10x as much data and we'll give you
3x as much assembly :).
Related to this, I commonly get the question: "but you're ignoring 80%
of your data by only annotating your contigs!" Yep. I'd rather have
20% accurately annotated than 100% badly annotated, but YMMV -
hope you have a good rationale for how the increased false positive
rate is not a problem.
Take home career messages
Looking back, I'm pretty sure I wouldn't have taken on this project
if I'd understood the problem better. But it's been one of the projects
that's defined my research program. Hmm, food for thought.
Despite our progress, I've repeatedly failed to get funding for
environmental metagenome assembly. Either I'm really bad at
explaining the problem, or it's not viewed as an important problem, or
we're tackling problems that not many people have yet. How do you
Good methods generalize. We've done our best to minimize parameters and
heuristics in our approaches, and, in reward, our approaches apply broadly.
We don't yet entirely understand the nature of the solutions -- partitioning
is annoyingly sensitive to coverage -- but we're working on it.
Tackling hard problems sometimes pays off! Most of my research program
came out of our various attempts to solve this problem.
There are comments.