Sun, 29 Aug 2010
Assembly is hard because it's not decomposable
(with Adina Howe, Jason Pell, Rosangela Canino-Koning, and Arend Hintze).
Introduction
A few weeks ago I blogged a bit about a k-mer filtering system, khmer, that we were using to reduce metagenomic data to a more tractable size by throwing out error-prone reads (see A memory efficient way to remote low-abundance k-mers from large DNA data sets). No sooner had we tried that, than did we realize that we were probably primarily throwing away good, if low-abundance data (see Illumina reads and their features). No matter: we couldn't assemble the original data sets anyway, so we had to get rid of some of it, right?
The subject of this blog post is not on how to best throw away data. (I'll address that in a few weeks.) Instead, it's on why we have to throw away data in the first place. More precisely,
Why is assembly hard?
First, some background. Imagine you have some long-ish strings (1mn - 200 mn in length), composed of only the letters A, C, G, and T, and you want to know what the sequence of the strings is. You can't actually read the sequences directly; they're too physically small. But you can randomly retrieve short subsequences ~100-1000 letters in length from the original long sequences. You don't know where they're from on the original sequence, or even which of the original sequences they're from. And the process of retrieval is error-prone, so you can't even trust the exact sequence you get. But you do know that, by and large, the short sequences are mostly correct; and (the most important bit) that you can get as many of these short sequences as you want, within $$ limitations.
From this kind of information you want to reconstruct the original sequences.
This is a basic description of the process of shotgun sequencing, in which you take DNA, shred it, and then sequence from it randomly -- many, many times. And it lays out the basic problem of assembly, too: you want to figure out how to reconstruct the original sequences from the little subsequences that you actually have.
If you are a computer scientist, you can probably already think of some basic ways to proceed. For example, you could do an all-by-all comparison of the short sequences, lay out which ones overlap and how, build a map of the overlaps, and try to build a tiling path that maximizes the connectivity of your map. Voila! Some approximation of the original sequences results! This approach is known as the overlap-layout-consensus approach, where at the end you produce a consensus view of the original sequence based on all the reads you have.
If you are a computer scientist or someone who programs for a living, you will also immediately recognize this as a rilly rilly hard problem! Forget biological peccadilloes; just doing this efficiently for large collections of sequences is computationally quite difficult. In particular, the all-by-all comparison is brutal: the number of comparisons scales as N**2 with the number of sequences N, so even if it's relatively efficient to compare two sequences, the problem behaves poorly as your data set grows. Plus, building a map of the overlaps is another hard problem: holding all that information in memory requires (yep!) O(N**2) memory, which is not cheap.
Is there any easy way to break down the problem? After all, big computers aren't cheap, but small computers are; so if you could split the problem into many smaller chunks, you could imagine using a grid or Beowulf approach, and just buying lots and lots of cheap hardware to scale.
Alas, the problem isn't easy to subdivide. It's easy to see why, if you think about the nature of the original sequences. Here's a little diagram; suppose, for example, that you have four subsequences all derived from one original sequence:
(orig) atggaccagatgagagcatgagccatggacggatcatggaaaacggttaaaaggggcatgg (1) atggaccagatgagagca (2) gagcatgagccatggacggatc (3) ggatcatggaaaacggttaaaa (4) ttaaaaggggcatgg
If the layout above is the only way that subsequences 1-4 overlap and can assemble, then to decompose the overlap problem across multiple computers would involve sending (1) and (2) to one computer, and (3) and (4) to another, assembling them there, and then taking the results and composing them on a shared node. Unfortunately, to do this efficiently currently requires that you know that 1 and 2 overlap, and that 3 and 4 overlap -- which is basically the problem that you already need to solve!
As I understand it -- I'm not a computer scientist unless you look at my letterhead -- there is simply no efficient way to decompose the overlap-layout-consensus assembly algorithm without either assuming something about the structure of the data, and/or introducing errors. (If you disagree, I'd appreciate either a reference or an implementation; thanks ;)
The second, or possibly third, generation of assemblers
OK, but computer scientists and computational biologists aren't dumb, and they like to tackle hard problems, and frankly this is an incredibly important problem to solve (for all sorts of reasons that you'll have to trust me on for now). Moreover, N^2 scaling is simply unacceptable!
Newer assemblers use a de Bruijn graph approach. Essentially, this involves breaking the subsequences down into fixed-length words of length k, and constructing an overlap graph. For example, taking the sequences above,:
(orig) atggaccagatgagagcatgagccatggacggatcatggaaaacggttaaaaggggcatgg (1) atggaccagatgagagca (2) gagcatgagccatggacggatc (3) ggatcatggaaaacggttaaaa (4) ttaaaaggggcatgg
you would break the original sequences down into words of length (say) 5, yielding:
atgga gatga catga atgga atcat aaacg aaagg
tggac atgag atgag tggac tcatg aacgg aaggg
ggacc tgaga tgagc ggacg catgg acggt agggg
gacca gagag gagcc gacgg atgga cggtt ggggc
accag agagc agcca acgga tggaa ggtta gggca
ccaga gagca gccat cggat ggaaa gttaa ggcat
cagat agcat ccatg ggatc gaaaa ttaaa gcatg
agatg gcatg catgg gatca aaaac taaaa catgg
aaaag
The overlaps between k-mers now implicitly give you a graph connecting each k-mer to all overlapping k-mers; and if you can find a path that traverses every node in this graph once, you will have your original contig.
Note that this actually works, although of course k must be much bigger than 5 in practice, and there are all sorts of cute tricks you must play to do a good job of disentangling complicated graphs.
Why is this an advantage over the overlap/layout/consensus approach that we looked at first? I'm not sure I've identified all the reasons, but there are at least two very important ones.
First, memory usage. While your memory usage for finding overlaps grows > O(N) with the overlap approach (with sparse matrices it should be N log N, I think?), the de Bruijn graph approach consumes only as much memory as you need to represent each new k-mer (so, with the number of novel k-mers) as well as the connections between them (which can be implicitly represented if you have efficient k-mer lookup). For large, deeply sequenced data sets this is going to be a huge savings: there are only three billion bases in the human genome, and probably only two billion unique k-mers of length 32 -- so if you can store k-mers efficiently (hint: you can) then the de Bruijn graph approach is really great.
Second, k-mers and k-mer overlaps can be stored and queried efficiently -- you just use a hash table or a trie structure. For example, you can store all 4**17 k-mers of length 17 as 34-bit offsets in a hash table (2 bits per DNA base), or you can use a branching trie structure to store arbitrarily long k-mers (see tallymer). Hash tables are be efficient (if big) representations for densely occupied k-mer spaces, while tries will be efficient for sparsely occupied k-mer spaces. Arbitrary length sequences are comparatively difficult to store and query.
The de Bruijn graph approach is what Velvet, ABySS, and SOAPdenovo use, and it seems to work well.
So what's the problem?
Scaling. Scaling is the problem.
Well, that and the sequencing companies and the biologists.
Let me explain. Sequencing companies are producing newer and bigger and better machines, that produce more and more sequence, every week. The Illumina GA2 produces 10-100 Gb of sequence per run now. The HiSeq 2000 is going to produce even more enormous amounts of sequence as soon as we get one. And more, lots more, is on the way.
This wouldn't be a problem if biologists would just stick to the exciting old problems, like resequencing humans and doing transcriptomes etc. But noooo, biologists see these juicy new sequencers and think -- hey! I could sequence populations of organisms! Or, like, 30 new organisms at once! Or 30 transcriptomes at once! And it will be cheap! (And we'll have someone else do the bioinformatics, which is easy, right? Right?)
So the sequencing companies are producing newer and cheaper and faster sequencing machines, and the biologists are using them to tackle ever more exciting and novel and challenging biological questions, and ... guess what? Our existing tools and approaches don't scale very well.
For one very specific example, the de Bruijn graph approach breaks down completely if you are sequencing endlessly diverse populations, as we seem to be doing in metagenomics. If you have some high abundance organisms, and a lot more low abundance organisms, and you sequence the organism soup to some arbitrary level, the novel k-mers will swamp your assembler, and to no end -- because those k-mers are never going to assemble to anything big without more sequencing. In which case you've compounded your swamping problem in an attempt to solve your earlier problem.
Similar things happen with wild population sequencing, where you get new and diverse sequences every time you look at a new animal; humans, even with their relatively low diversity, are one fine example.
OK, so this is the problem to solve, and I think it's a really big problem. It's not decomposable so it can't be made to scale well, and we're already at the limit of our existing compute infrastructure for the data we already have. (See Terabase metagenomics -- the computational side and grim future for sequencing centers.) And as we try to inch the boundaries along, the sequencing companies are producing new and bigger machines to give us new and bigger amounts of data.
Are there any solutions? No really good ones, unfortunately. The solution du jour (see MetaHIT methods and my earlier blog posts) is to throw away low abundance data that you figure won't assemble, and/or subdivide the sequences by abundance, in the expectation that similar abundance sequences will come from the same original genome. These are basically approximation heuristics, hoping to reduce the data in such a way that the assembler can deal with it. The hope is that the assembler can do a not-terribly-bad-job of assembly based on the known structure of the population.
Moreover, the throwing-away-data solution won't scale very well; soon enough you'll be throwing away not just 90% of the data, but 99% of the data, just to get a tractable data set.
We are doomed, doomed I say! Clearly we should give up.
Anyway, this concludes part one of a series of blog posts on assembly. In part two, I plan to talk a bit about paired-end sequencing and repeat sequences.
--titus
p.s. An excellent assembly algorithm reference: Miller, Koren, and Sutton, Genomics, 2010.
posted at: 15:07 | path: /aug-10 | 4 comments
Sun, 25 Jul 2010
Terabase metagenomics -- the computational side
The Terabase Metagenomics meeting was good fun, but I most valued the computational component (because that's what I do). Rachel Mackelprang and Rob Knight and I wrote down a list of the computational issues involved in a petabase metagenomics project, and that list will help direct my future research. I'll try to write up the list sooner rather than later, but in the meantime, I wanted to write down some general thoughts on the subject of processing large amounts of data for metagenomics.
In any petabase metagenomics sequencing project focused on probing the unknown, we are going to have to face two basic computational issues: assembly and sequence comparison. Both of these issues require an all-by-all analysis, or at least the approximation of one. That is, as our sampling grows, we're going to want to assemble contiguous DNA sequences from larger and larger pools of short sequences; this requires some form of comparison that will scale as the amount of genomic novelty increases. We're also going to want to take the products of sequencing and assembly (raw reads and contigs) and compare them or derivative products to all known genes. By and large, this problem boils down to systematic comparisons of points in a sparsely but maybe thoroughly occupied n-dimensional space -- an N-squared problem that has no exact parallelizable solution.
Currently, neither assembly nor gene comparison scales well. Metagenomic assembly is currently the most disastrous: existing short-read assemblers use De Bruijn graphs and require ginormous amounts of memory (half-terabyte amounts or more) to run. Gene comparison is largely done using custom tools (for e.g. rRNA comparison) or the more general BLAST and HMMER/profile tools; these tools scale reasonably well for comparisons, but for post-processing into high-quality phylogenies and gene families, there is no scalable solution that I know of.
I firmly believe that the intractable-looking problem of assembly is going to be among the first to fall. This is because, at least in theory, metagenomic assembly should be massively parallelizable: metagenomes contain many distinct genomes that don't connect to each other, so if you can filter or bin the raw sequences into assemblable subsets, then you will be able to run the individual assemblies on as many computers as you have original contigs. This is a logical and simple approach that simply awaits appropriate heuristic techniques for division of the reads -- and we'll figure that out sooner rather than later. Plus, people much smarter than me are actively working on better ways of parallelizing the basic De Bruijn graph approach, and that will inevitably bear fruit.
The marker-gene comparison problem may already be solved. For any given gene family (like 16s) you can already do linear-time comparisons with profile methods like HMMER. Using existing breakdowns of phase space as a scaffold for new marker data as it arrives should be possible. At least, I can think of several ways that it will work, and Rob Knight is quite optimistic. (This is his area of expertise, and since he's a Python programmer, I'm inclined to give him the benefit of the doubt. ;)
The big problem facing in the future, I think, is the clustering and discovery of gene families. As we sequence deeper into the unknown, we're going to have to do ever larger, more systematic, and more sensitive comparisons of new data against all known genes, in all types of ways -- DNA and protein sequence, RNA secondary structure, and protein structure. The goal will be to elucidate gene families (which is an ill-defined concept), their phylogenetic structure, and as much as possible about their function, across all of the domains of life. This is hard because we are explicitly looking for trace connections -- distant homologies and similarities -- that we haven't seen yet. Unlike the marker-gene approach, we don't have good heuristics for reducing the problem to one of interpolation, and this means that the problem remains (at first glance) an O(N^2) problem. Even if you use a prefiltering approach (like BLAST or HMMER 3) you are still supra-linear, which is deadly when you're talking about billions or trillions of sequences.
Memory usage is particularly problematic for gene comparison, with most clustering techniques requiring scads of memory (N^2, in the extreme) for comparison problems. This may be something we can reduce with heuristic approaches, but biology has a way of messing with us so we can't rely on heuristics up front - especially in large discovery-based projects.
There's an additional complication, though: we really want to reduce the problem to one of online processing. As we gather petascale data, we don't want to do all-by-all comparisons of the new data against the old data; rather, we want to fit as much of the new data into an existing structure of classifications, and then sideline the remaining data for more complex exploration. This results in an O(N) classification for the "easy" or known stuff, while dealing with the hard stuff remains Bad (supra-linear, and perhaps exponential). For example, with respect to assembly, we expect to have assemblies of most of the more abundant soil genomes within the next 2-3 years; we should be able to pre-screen new data by seeing if it matches to an existing assembly, and if it does, simply count it (or discard it). Only the data that matches to no prior assembly would be kept. A similar approach can be used for 16s or marker-gene data. I can think of ways to approach it for protein-coding genes and non-coding RNA as well, but I'm very afraid of missing some or many of the unknown unknowns in our data sets by being too liberal with classification. On the flip side, being too conservative will inevitably lead to scaling problems, so there's no way to win.
As with all Big Data problems, the sheer scale of these issues causes all sorts of annoying data processing and storage problems. It's become a ruling principle of my tool development that algorithms must be constant memory: hence things like our k-mer filtering approach, which trades accuracy guarantees for pragmatic issues like the ability to run the software in the first place.
It's worth noting that Moore's Law will not save us. Neither memory nor CPU power is growing as fast as sequencing capacity -- some estimate the exponent on sequencing capacity as being > 5. (Thank goodness, disk space will probably keep up!) We're going to get a huge increase in the number of cores, but that doesn't really help with Big Data unless you have robust pleasantly parallel algorithms that don't require much memory -- which we don't, and in general probably won't, have.
So, my (our) outlook for the next few years will be: build tools that scale way better than our existing tools, in all directions (time, memory, parallelizability); yield guaranteed behavior with respect to memory consumption in particular; and are optimally sensitive and specific for the large class of unknowns (new proteins, ncRNA, and structures) that await us in the world of sequence.
Sounds like fun!
By the way, for those of you who like to think about problem equivalence, the protein comparison problem ultimately boils down to comparing protein structures. We, umm, can't solve that one, either...
--titus
posted at: 07:37 | path: /jul-10 | 0 comments
Sat, 24 Jul 2010
Terabase metagenomics - meeting and outlook
I'm on my way back from the Terabase Metagenomics meeting in Snowbird, UT, and I'm buzzing with ideas about how to move forward in metagenomics and bioinformatics research. Metagenomics, the use of genomics approaches to study microbial communities, has been opening up as sequencing drops in price. With sequencing becoming ever more ridiculously deep and inexpensive -- we're looking at a power of 2-5 increase in depth every year for the next 5 years -- the questions have increasingly become intertwined with (surprise!) computational science problems and specifically bioinformatics. As we move towards multiple individual terabase (and collectively petabase) sequencing data sets in the next two years, life is going to become more complicated for metagenomicists and the bioinformaticians who aspire to help them. (That's me.)
The meeting was focused on the promise of terabase metagenomics for a "whole earth" microbial ecology sampling projects, and we spent quite a lot of time trying to figure out the size of unknown unknowns in metagenomics. There was a great mix of people at the meeting -- both "pure" microbial ecologists who regarded computers with suspicion and computer scientists who didn't understand how DNA became protein, as well as the entire range between those two poles. We talked about questions such as: "how deep do we need to sequence in order to get a real picture of microbial diversity?" (a: very) and "how do we reconcile genomic and transcriptomic data sets?" (a: unknown) and "how can we scale up 1000-fold when our existing algorithms are all N-squared?" (a: with difficulty.) The discussion was at a reasonably high level of sophistication: pretty much everyone in the room had "seen the elephant", and representatives of the major analysis pipelines were also present.
We didn't come to any firm conclusions about the biology, but we did sketch the outlines of a global sampling project to look at global microbial diversity, together with the deeper sequencing of specific locales to plumb the depths of a few communities. Coming up with a computational pipeline to deal with data and metadata on this scale -- storage, analysis, integration, presentation, and querying -- is truly challenging and should be fun.
For me, one recurring theme in the meeting was entanglement with management and standardization issues in big projects. Everyone was bitching about these on their existing projects, and I'm trying to avoid them. I have little or no interest in telling other people how to do their job; absolutely no interest in being told how to do my job; and a general lack of patience with standardization issues, however important and critical they may be. (In some ways, the Benevolent Dictator For Life approach to decision making used by the Python community seems optimal for medium-sized projects, although of course it takes a special person to pull it off. I acknowledge that decisions need to be made, but have no real interest in doing anything more myself than exploring the technical issues that could inform the decisions. Hmm, I wonder if the PEP process might be a good way to do a formal standards process in metagenomics?) I don't if there's any way for me to avoid this stuff in the future, but it's the main reason that I'm wary of joining big projects.
A few interesting references came up during the meeting. For example, I hadn't seen the Heilmeier Catechism before; I think it's a good idea for both scientific and open source projects to look at this before they do too much work! I was also entertained by Jonathan Eisen's report on things that the workshop participants disliked: for example, "Bureaucracy (and how it impedes science)."
All in all, I think this meeting was fantastic: very small, very focused, and with plenty of time for discussion and socializing. I'm happy the organizers invited me, so thanks to them!
--titus
posted at: 15:51 | path: /jul-10 | 2 comments