The story behind the spacegraphcats project and paper

We recently posted a preprint entitled "Exploring neighborhoods in large metagenome assembly graphs reveals hidden sequence diversity" (biorxiv link). This work is the first but hopefully by no means the last product of the spacegraphcats project, a collaboration between my lab, Blair Sullivan's lab at NC State, and Jeff Heer's lab at UW Seattle. It is the first fruit of two and a half years of intense work by one of the most interdisciplinary teams I've ever had the pleasure of working with, and I want to share some of the story!

The background

In January and February of 2016, I read two papers that had a tremendous impact on my thinking - the MetaPalette paper and the mash paper (see my blog post on k-mers and taxonomy for more background here). I immediately jumped on these concepts, reimplementing mash in Python and started playing around with it - this was the start of sourmash.

I then spent a few months playing around with applying MinHash to De Bruijn graphs of metagenomes. Since MinHash works with k-mers, and De Bruijn graphs are just collections of k-mers, I thought, hey, maybe we can do something cool with MinHash and De Bruijn graph structure. So I spent a lot of time hacking on khmer and sourmash and playing with various concepts.

(I still think the ideas I generated during this phase are really promising, but after two years I have many, many, many detailed reasons as to why most of them are naive and not only don't work but can't work. Science sucks, man. But it did lead in several excellent directions so the work is not wasted.)

The conclusion I reached from all this work was that we needed a good way to select k-mers from neighborhoods within a graph. Collections of shotgun reads were not so useful for this - close bits of the graph are not local within the reads. We'd done some work on partitioning graphs as part of Pell et al., 2014 but I knew that partitioning wouldn't work here. (Turns out assembly graphs are generally completely connected, for Reasons.)

Enter the Barnraising for Data-Intensive Discovery

In May 2016, I joined about 20 people at an event organized by Blair Sullivan and Casey Greene, entitled the Barnraising for Data-Intensive Discovery. The idea was to gather a bunch of researchers in an isolated location for a week, define some projects of mutual interest, and hack on them together in a spirit of mutual assistance - hence "Barnraising". The invitees included anyone funded by the Moore Data Driven Discovery program, and we ended up with a nice mixture of computer scientists, biologists, mathematicians, and other researchers.

The Barnraising was held at the Mount Desert Island Biological Laboratory in Maine, and while the weather was absolutely miserable (cold and rainy!) the environment inside was pretty great! At the opening meeting I pitched my "hey I need to cut graphs up into neighborhoods" idea and it intrigued several people, including Blair and two people from her lab, Felix Reidl and Michael O'Brien, a postdoc and grad student respectively. We also recruited Dominik Moritz, a graduate student from Jeff Heer's lab at UW (Jeff was also an Investigator, and the UW eScience Institute was a Moore DDD Data Science Environment awardee).

(These four researchers ended up being four of the other five authors on this paper!)

My initial (bad) idea, and the origin of the name "spacegraphcats"

The idea I'd pitched to the larger group was that metagenome assembly graphs should, by all rights, consist of neighborhoods of more densely connected k-mers (representing individual species) that were sparsely connected to other (unrelated) k-mers; and that this sparse graph could, potentially, be cut into species neighborhoods. I was interested in looking directly at the assembly graph structure to make these cuts, without going through the filter of outputting contigs (which do cut the graph, but... not in a way that I trust :).

During the course of the project, the ideas got significantly modified as it became clear that I didn't know that much about graphs in general, or metagenome graphs, and that my intuition was (in places) flatly wrong.

Now, you may wonder why the software mentioned in the paper is called "spacegraphcats". Well, if you pass "sparse graph cuts" through the mind of an idle German mathematician (coff FELIX coff) then said mathematician may come up with the name "spacegraphcats". And of course when looking for logos and related memes online, well, "space" and "cats" make things easy. And thus "spacegraphcats" was born.

Despite the changes in the project focus, the name stayed at "spacegraphcats", 'cause it was cool... and why not?

What did we actually end up doing?

The whole project has a pretty broad scope, but for this first paper, we ended up focusing in on a fairly targeted question: can we improve the completeness of metagenome-assembled genomes, or MAGs?

Briefly, MAGs are genomes computationally extracted from metagenomes. The usual process of inferring MAGs is to assemble a metagenomic sample, then group contigs together into putative genomes based on sequence composition, estimated contig abundance, and phylogenetic marker genes. The completeness of these "genome bins" could then be estimated by the fraction of single copy genes thought to be essential for life.

The process seems to work pretty well, and MAGs have become incredibly popular, with well over 10,000 MAGs extracted in the last few years (see the first paragraph of the spacegraphcats paper for references).

But I suspected that there was more to be done. There are several shortcomings of MAGs and the binning progress: the main one (for me) is that generally the binning depends on contigs that come out of DNA sequence assembly, which is a challenging and fragile process. I believed that metagenomes probably contained more sequence and more sequence variation than was represented in the MAGs, and I thought we were probably underestimating the content of MAGs by ignoring accessory genome content. But, while I have reasons for these opinions, I don't (didn't) have specifics - it is surprisingly difficult to nail down missing content in complex communities!

So I was really, really interested in taking a look at what was in the graph neighborhood of these bins. And after some significant rephrasing of the original sparse graph cuts question by Felix, Mike, Dominik, and Blair, we realized that a graph neighborhood search function would be directly useful. The idea would be to reach into the assembly graph with a set of query k-mers (or reads) and extract the full set of k-mers proximal (in the graph) to those queries.

What this led to was an implementation of an r-dominating set algorithm, where a subset of nodes in the compact De Bruijn graph are chosen such that every node in the cDBG is within distance r of one of these chosen nodes. This forms what is known as a dominating set, and it provides a kind of clustering over the entire graph. These dominating nodes could then be used to define a local neighborhood: we could go from a k-mer to a cDBG node to its dominator to the dominated nodes, and retrieve everything that belonged to a dominated nodes as part of that neighborhood.

Now, defining an r dominating set is NP-hard on large graphs, but conveniently, Blair's group was interested in and knowledgeable about efficiently calculating certain properties for graphs of bounded expansion. Graphs of bounded expansion permit certain optimizations of otherwise difficult algorithms, and (mirabilis visu) compact De Bruijn graphs turn out to be just this kind of graph, because each node has at most 8 edges.

In the end, we were able to calculate r-dominating sets in linear time for cDBGs because of this property. (Again, see the paper for details.)

Implementation, implementation, implementation ...and real data.

We actually walked out of the barnraising with a functioning dominating set calculator that we could apply to real data. But we had to build a lot of code infrastructure around this, including -

code to build a cDBG (since thankfully replaced with BCALM);
graph indexing that let us go from a k-mer to a cDBG node (we ended up using the bbhash minimal perfect hash function for this);
graph indexing that let us go from a cDBG node to the set of reads that had contributed k-mers to that node;

...all of which now resides in the spacegraphcats codebase.

The process of building the spacegraphcats program was heavily iterative. I think we went through at least two rewrites of the dominating set algorithm, which is now the least computationally intensive part of the whole package. I spent more than 18 months trying to get MinHash based search to work, only to essentially prove that due to the rampant strain variation in real data sets, MinHash based methods were too insensitive to find graph regions at the necessary resolution. (More on this at some other time.) Overall, I think we wrote 5-10x as much code as we ended up using in the final paper.

So what did we find?

The two key biological findings of the paper, in my mind, are these:

1) The graph neighborhoods of metagenome-assembled genomes contain real content that almost certainly belongs in the bin, but that was not included in the bin.

2) One of the reasons for this content to be missing is that there is a lot of strain variation present in the neighborhood, and this prevents assembly (and hence binning). It's hard to estimate the magnitude of this effect but it's at least 10-20% in the data set we chose to benchmark with.

We do lots of other stuff in the paper, and it's chock full of interesting ideas, but those are the two main takeaways for biologists.

(If you're a computer scientist, you should be salivating at the dominating set algorithm and implementation. But this blog post is for biologists :)

What is particularly neat about the paper?

I think the underlying spacegraphcats algorithm, code, and way of thinking is powerful and will be enabling in all sorts of unexpected ways. But of course I'm not yet sure which parts are going to actually be useful to others. We chose the binning work for this paper because it was the part we could most concretely discuss; it is definitely not the most interesting thing about the spacegraphcats project overall.

Probably the coolest unexpected thing about this paper was how well the Plass amino acid assembler worked on the neighborhoods. And therein lies a bit of a story --

You see, we got to the point where everything was working in ~February or March of 2018. We could do the genome queries, we could get the reads from the neighborhood, and... we were kinda stuck there. If (as we suspected) the reads didn't assemble well due to strain variation, then what do you do with them? We were thinking of running different nucleotide assemblers, but it is rare that one assembler unambiguously outperforms another assembler. And there wasn't much we could do without an assembler here.

This point bears repeated emphasis. The strong underlying hypothesis was that one of the major reasons sequence wasn't being binned was because it wasn't assembling. But you can't really do all that much with 100 bp reads other than try to assemble them! And (as we suspected) the nucleotide assemblers didn't perform all that well on the reads, when we measured it. What could we do??

Assembling in amino acid space

So we started hunting around for ways to look at the reads that would not involve running them through a nucleotide assembler. I'd recently seen the Plass paper, which talked about assembling in amino acid space, and so I decided to give it a try; in theory, at least, amino acids should be less affected by nucleotide strain variation, because all of the synonymous codons would be collapsed.

Somewhat to my surprise, the Plass assembler worked wonderfully well out of the box, and - in at least a few cases - dramatically increased the estimated completeness of the bins. When we looked, we confirmed that Plass was (as hoped) assembling many more protein sequences. And we also saw a surprising amount of amino acid variability in what was being assembled.

But what did it mean? WHAT DID IT MEAN??

Enter the final author!

Here we asked Taylor Reiter to dig in. Taylor is a graduate student in my lab who joined the project to help with the biological data analysis. She'd been working on assessing the functional content of the bins since March 2018 or so, and when I threw the Plass assemblies her way she tore into them.

What she discovered was that even with ridiculously stringent error trimming on the underlying data (a hard k-mer trim of abundance=5 on all the reads prior to assembly), we were seeing many minor variants in really highly conserved genes like gyrA. I think we still don't have a handle on the number and importance of these variants, but at the nucleotide level there seem to be dozens of variants. And, at least in a few neighborhoods, there seem to be two or more high abundance amino acid contigs, suggesting two or more significant strains.

And that's where we left the paper.

The paper concludes with the observation that there's a lot of unseen variability in metagenomes, and that they are unseen in part because of challenges with the tools that we use. It's not clear how much biological impact this might have on MAG analysis, e.g. on attempts to reconstruct metabolic potential, but indications are that it has at least some. We're thinking about how to follow up on this concretely.

In the meantime, with this first paper, we have produced a reasonably robust tool, based on a strong and efficient algorithm, that lets us systematically look at metagenomes in a way that we basically haven't been able to do before. I'm pretty stoked about this, in part because (as has been my experience with other papers like this) I suspect that there will be many interesting computational extensions that can be built on this foundation, by people that are smarter than me about assembly graphs. I'm looking forward to it!

--titus

There are comments.

social