Our review of the eigengenome partitioning paper

Below is our first review of the paper Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning by Brian Cleary et al., recently published in Nature Biotech.

All in all, this is one of the most technically interesting papers we've read in a while.

A few interesting tidbits of background info --

  • this is the paper that I talked about at the top of Please destroy this software after publication. kthxbye. (The software appears to be more usable now that it was when we first reviewed the paper.)
  • after signing the review, I engaged in separate conversations about the paper with the two other reviewers, the editor, and the authors ;).
  • I would guess that the authors lost at least a dozen citations by not publishing this as a preprint. I consciously avoided mentioning the approach, using the software, or developing on top of the method while it was in review, because it wasn't publicly available yet and I didn't know how long it would take to come out. Maybe this isn't a big deal in the larger scheme of things, but I think it's a very clear example of science being slowed down by publishing delays.
  • one of the other reviewers suggested that they look to our own work on partitioning (Pell et al., 2012) and do a comparison. I think I sent an e-mail to the authors saying that this would be a silly comparison because their method should work way better than ours, and they could quote me on that ;). (I gave more details in the e-mail, but, basically, graph partitioning as-is doesn't work that well on large samples.)

In this paper, Cleary et al devise a clever (dare I say "groundbreaking?") approach to raw-read partitioning of metagenomic data. This approach addresses a variety of problems in shotgun metagenome analysis. In short, the approach is entirely novel, the scaling potential is amazing, the results seem quite good to us, and the practical need for such an approach is great.

We divide our review into three parts -- the results & discussion; the methods; and the software. Details below.

Given the novelty of the approach and the strength of the results, we personally don't think it's reasonable to require that the software be made more usable, but this may be a requirement of NBT. We do think a small test data set that can be run on a single machine should be made available for reviewers and readers to try out; this shouldn't be particularly burdensome or time consuming.

The authors should consider posting this as a preprint so we can cite this in forthcoming reviews while waiting for the no doubt lengthy review process to complete...

Revisions that should, in our view, be required, are marked with '**'.


Results and discussion:

The authors develop a partitioning approach that splits up raw reads into different bins, or partitions, based on an eigengenome clustering approach. The goal is to separate reads by source genome based on abundance covariation, so that they can be assembled or analyzed by donwstream approaches. A key part of the approach is to do this partitioning scalably, so that extremely large data sets can be partitioned and then analyzed with more intensive approaches later on.

To validate this approach, the authors apply their partitioning approach to several large data sets and generate what, to those of us who have invested in raw-read partitioning, are excellent results with the normal drawbacks. Specifically, the partitioning seems to do a great job of separating out reads that should belong together, with the caveat that separating highly similar sequences (core conserved regions in strains) is effectively impossible.

The first paragraph is a general intro that's a little unsatisfying: first, metagenomic researchers frequently sequence billions of bases because otherwise _they don't sample the diversity_, not because of the challenges they face with assembly pipelines! And second, it's not at all clear that genomic content can be "identified through popular alignment-based metagenomic analysis." Sure, that's what's used, because they have no choice, but it doesn't work very well except for SSU analyses.

The rest of the introduction is very well done - technically sophisticated and pretty comprehensive. I'm a little leary of the strength of the statement that "In this scheme, short reads which originate on the same physical fragment of DNA are likely to partition together..." - isn't that difficult to conclusively demonstrate?? Maybe "should" would be better than "are likely"?

The data sets chosen are good test sets. The application to a multiple TB data set (!!) is amazing. The Sharon data set is a perfect application of this approach and the results look good to me.

I can't find a place where the authors say this outright, although they imply it in their discussion -- they should point out that MORE SAMPLES gives you MUCH MORE RESOLUTION with this approach, which is a key feature in the sample-heavy future.

The authors are (in my opinion) quite right that their approach is a dramatic improvement over assembling first and then binning later; this approach leaves open the option of applying different assemblers or parameters for each bin, which could be quite important, and is also likely to be much more scalable than any assembly-based approach.


Methods:

The methods are (or at least seem to be, to us) pretty groundbreaking, and more broadly applicable than metagenomics. We do not think that using document vectors/latent semantic indexing for unassembled sequence analysis has been done before -- this is enabled by their use of the complex-simplex hashing function. We believe this to be novel (or at least we would be using it if we had ever seen it before.)

The authors effectively construct a (sample x k-mer abundance) matrix that they then cluster. However, this on its own is not scalable, so they invest in a nice hyperplane clustering approach, relying on a complex-space hashing approach that places close k-mers near to each other in k-space. These k-mers are assigned to the same column in the matrix, which effectively decreases the size of the matrix to something much more manageable. This matrix is then clustered using SVD. Reads are then assigned back to these k-mer clusters.

For each sample the number of columns must scale ~with the diversity of the sample; this seems likely to be the major memory bottleneck.

It seems to me that the resolution could be increased iteratively, by some sort of adaptive splitting of columns in tandem with the SVD. This is probably part of their future work.

An important feature of the SVD implementation is that it is streaming!

We think the complex-simplex approach and clustering could easily be applied to many parts of sequence analysis, including mRNAseq analysis, error correction, and graph building.

In any case, this all makes sense to us and we think it's novel.


Software:

The software is available on github, good!

** No license is specified; this needs to be addressed.

** No copyright is specified; this needs to be addressed.

** The specific version (git hash? branch tag?) used to generate the
results in this paper should be specified somewhere and placed in the paper.

The software is clearly not intended for direct reuse. There are stern warnings about how it's going to be hard to use in any particular environment, and it's a mess o' scripts.

** Space limitations in NBT prevent replication details (detailed
parameters, etc.) from being provided in the text. Fine, but please provide them somewhere (github?) and point at them!

[ ... typo commentary redacted ... ]

Signed,

C. Titus Brown, MSU

Camille Scott, MSU

Comments !

(Please check out the comments policy before commenting.)