Genomics leads in random directions, but ... how on earth did I end up
working on this??
The Story begins
The story of my role in the Haemonchus paper starts about 3 years ago.
My friend & colleague Erich Schwarz is a guy who specializes in
nematode genomics, and he was sequencing, assembling, and analyzing
the Haemonch genome with a team down in Melbourne (largely Robin
Gasser's group, I think) as well as Paul Sternberg's lab at Caltech.
Well, I say he was assembling it, but he wasn't being very
successful at it. You could have said "throwing assemblers at it
desperately" instead. Every time I talked with him he regaled me with
tales of woe, some of which have made it into my slides (see slide
-- to quote, "The power of next-gen sequencing: get 180x
coverage... and then watch your assemblies never finish." The basic
problem was that every assembler he tried would slurp in the reads,
churn churn churn, and then at some point die. Or not die. But one
thing no assembler would do was output anything even remotely
resembling a decent assembly. Basic quality filtering helped a bit --
Erich could get the sequencing data from short-insert libraries to
assemble, but as soon as he put in the bigger insert data, kablooey.
It didn't seem to have anything to do with compute power; rather,
something was fundamentally weird about the sequence.
While he was busy throwing himself at large volumes of sequencing
data, I was working with Jason Pell and Adina Howe on metagenomics --
that is, we were throwing ourselves at even larger volumes of
sequencing data. Most vexing. Eventually we figured out a technique
(partitioning) that let us cut the problem down to size, but it clearly wouldn't
work on genomes so was of no use to Erich.
As we worked, however, we found that there were annoying levels of
connectivity in our metagenomic
data. And our efforts at removing this connectivity turned out to
be of great use (although it wasn't the final kicker).
When Erich and I discussed his assembly problems, we figured it had
to have something to do with repeats and polymorphism. If you have a
repeat-rich genome and a lot of polymorphism, you're gonna have a hard
time assembling much; plus, efforts to traverse those regions and
assemble them into contigs (or ignore them) were going to run into
combinatorial complexity. So when I said we had a way to remove
highly connected regions but that it was meant for metagenomes, Erich
asked me to run it on his data.
I said no. There was no reason to believe that it would work.
I still said no.
He continued to insist. At some point, he must have bought me some
good steak or something. And then I decided to run our delumping code
just to get him off my back.
... Voila. Erich took the delumped data, fed it into Velvet, and went
to sleep. He woke up the next morning to a crash, or at least what he
assumed was a crash -- he'd never had an assembly complete in less
than a week on this data set. But, of course, it turned out to be an
actual real, useful assembly! And that was the start of the end of
the basic assembly problem for him.
This delumping wasn't the final filtering stage actually used for the
assembly. My lab was in an intense stage where we were throwing ideas
around a lot, and about 6 months later we figured out diginorm which turned out to be a much
better tactic for all sorts of things. (Although we're still very
fond of partitioning. More on that in a few weeks.) But that initial
result was the breakthrough in many ways, even though even today we
have at best an imperfect understanding of why it worked.
What was wrong with the Haemonch data?
As we now know, as well as originating from many different individuals
the sequence was filled with microbial contaminants; moreover the
genomes had been subjected to whole-genome amplification, which
results in dramatic coverage variation. So we had an almost perfect
quadrifecta: a eukaryote, steeped in a virtual metagenome, with high
polymorphism, and really nasty amplification bias. (At one point, I
estimated that the polymorphism rate was 10% -- one in ten nucleotides
was different between the two haplotypes of the diploid genome --
although I believe that number was probably inflated by a high
sequencing error rate. But still.) Digital normalization turned out
to solve the biggest of these problems, allowing an initial assembly; this
led to discovering the error problem and the metagenome problem, which
led to decontamination protocols; and then finally Erich built a whole
pipeline to post-process the data into a nice assembly. It was a heck
of a lot of work on his part, and I still feel marginally guilty about
freeriding into coauthorship (but only marginally ;).
And that, boys and girls, is how I ended up in the genome assembly
game, as opposed to staying in my neat little boxes of metagenome and
Since then, we've applied khmer based techniques -- largely diginorm,
but some others as well -- to a variety of genomes, and discovered
that in many cases they work quite well in terms of producing an
initial assembly. I believe that at least 2-3 more papers will be
coming out soon from other groups, pointing out that diginorm performs
reasonably well on genome assembly.
It's always nice to actually be useful.
More, I think it's a good story about the, ahem, serendipity that comes
from casual scientific interactions. I was working on soil metagenomes,
Erich was working on nematodes; whoda thunk our technology would help
What's next, bub?
We may build some standard protocols and assessment mechanisms for
genome assembly into khmer. I know that some groups have made a habit
out of diginorming their genomic data into oblivion, and it would be
nice to codify these practices into something more generally usable.
We also think we can use diginorm-based techniques to do some clever
things with combined Illumina and PacBio data, so hopefully more on
And, while I'm still planning to stick mostly in my particular corner
(metagenomes for everyone! transcriptomes for everyone!) I may nip out
every now and then and poke around with a genome.