A few weeks back, Nick Loman (via Manoj Samanta) brought MEGAHIT to
our attention on Twitter.
MEGAHIT promised "an ultra-fast single-node solution for large
and complex metagenome assembly" and they provided a preprint and some open source software. This is a topic near and dear
to my heart (see Pell et
al., 2012 and Howe et al.,
2014), so I was
immediately interested - especially since the paper used our Howe et
al. data set to prove out their results. (The twitterati also pointed
out that the preprint engaged in some bashing of this previous work,
presumably to egg me on. ;)
So I thought, heck! Let's take MEGAHIT out for a spin! So my postdoc
Sherine Awad and I tried it out.
tl; dr? MEGAHIT seems pretty awesome to me, although IDBA and SPAdes
still seem to beat it by a bit on the actual assembly results.
Installing MEGAHIT
We ran into some small compilation problems but got it working on an
Ubuntu 12.04 system easily enough.
Running it was also a snap. It took a few minutes to work through the
required options, and voila, we got it running and producing results.
(You can see some example command lines here.)
First question --
How does it do on E. coli?
One of the claims made in the paper is that this approach performs
well on low-coverage data. To evaluate this, I took a 5m read subset
from the E. coli MG1655 dataset (see Chitsaz et al., 2011) and further
subsetted it to 1m reads and 500k reads, to get (roughly) 100x, 20x,
and 10x data sets. I then ran MEGAHIT with default parameters,
specifying 1 GB of memory, and limiting only the upper k size used
(because otherwise it crashed) -- again, see the Makefile.
For comparison, I also ran SPAdes on the lowest-coverage data, looking
only at the contigs (not the scaffolds).
After it was all done assembling, I ran QUAST on the results.
Measure |
100x |
20x |
10x |
10x (SPAdes) |
N50 |
73736 |
52352 |
9067 |
18124 |
Largest alignment |
221kb |
177kb |
31kb |
62kb |
bp in contigs > 1kb |
4.5mb |
4.5mb |
4.5mb |
4.5mb |
Genome fraction |
98.0% |
98.0% |
97.4% |
97.9% |
Misassembled length |
2kb |
40.8kb |
81.3kb |
63.6kb |
(Data: MEGAHIT 100x, 20x, and 10x; and
SPAdes 10x.)
In summary, it does pretty well - with even pretty low coverage,
you're getting 97.4% of the genome in contigs > 500bp (QUAST's default
cutoff). Misassemblies grow significantly at low coverage, but you're
still only at 2% in misassembled contigs.
In comparison to SPAdes at low coverage, the results are ok
also. SPAdes performs better in every category, which I would expect
-- it's a great assembler! - but MEGAHIT performs well enough to be
usable. MEGAHIT is also much, much faster - seconds vs minutes.
Next question -
How fast and memory efficient was MEGAHIT?
Very. We didn't actually measure it, but, like, really fast. And low
memory, also. We're doing systematic benchmarking on this front for
our own paper, and we'll provide details as we get them.
(We didn't measure MEGAHIT's performance because we don't have numbers
for SPAdes and IDBA yet. We didn't measure SPAdes and IDBA yet
because actually doing the benchmarking well is really painful - they
take a long time to run. 'nuff said :)
So, what are your conclusions?
So far, +1. Highly recommended to people who are good at command line
stuff and general all-around UNIX-y folk. I'd want to play around
with it a bit more before strongly recommending it to anyone who
wasn't a seasoned bioinformatician. It's rough around the edges, and
I haven't looked at the code much yet. It also breaks in various edge
cases, but at least it's pretty robust when you just hand it a straight
up FASTQ file!
That having been said, it works shockingly well and is quite fast and
memory efficient. If you're having trouble achieving an assembly any
other way I would definitely recommend investing the time to try out
MEGAHIT.
--titus
- p.p.s. Thanks to Rayan Chikhi and Lex Nederbragt for reading and commenting on
- a draft version of this post!
Appendix: MEGAHIT and digital normalization
In the MEGAHIT paper, they commented that they believed that digital
normalization could lead to loss of information. So I thought I'd
compare MEGAHIT on 100x against MEGAHIT and SPAdes running on
digitally normalized 100x:
Measure |
100x |
DN (w/MEGAHIT) |
DN (w/SPAdes) |
N50 |
73736 |
82753 |
132872 |
Largest alignment |
221kb |
222kb |
224kb |
bp in contigs > 1kb |
4.5mb |
4.5mb |
4.6mb |
Genome fraction |
98.0% |
98.1% |
98.2% |
Misassembled length |
2kb |
120kb |
48kb |
(Data: MEGAHIT 100x,
MEGAHIT DN, and
SPAdes DN.)
The short version is, I don't see any evidence that diginorm leads to
incompleteness, but clearly diginorm leads to lots of misassemblies
when used in conjunction with MEGAHIT or SPAdes on high-coverage
genomes. (We have some (ok, lots) of evidence that this doesn't
happen with lower coverage genomes, or metagenomes.) That having been
said, it's clearly rather assembler-specific, since SPAdes does
a much better job than MEGAHIT on dn data.
The shorter version? You probably won't need to use diginorm with
MEGAHIT, and you shouldn't. That's OK. (There are lots of reasons
why you shouldn't use diginorm.)
I still don't have any evidence that diginorm drops information in
non-polyploid situations. Let me know if you've seen this happen!
Appendix II: Running your own evaluation
All of the E. coli numbers above are available in the
2014-megahit-evaluation github repo. See README.md
in that repo for basic install instructions, and Makefile
for what I ran and how to run it. Feel free to reproduce, extend, and
update!
There are comments.