Wed, 04 Aug 2010

Illumina reads and their features


(This project is a collaboration with Jason Pell and Adina Howe)

A few weeks ago I posted about a k-mer filtering approach that we were using to remove low-abundance k-mers from metagenomic data sets, prior to assembly. This technique is working well, and we've managed to do some assembly of soil data sets. Our next challenge there is figuring out how to evaluate the quality of the assembly!

At the time of this previous post, I said

... any sequence arising as the result of a sequencing error is going to be extremely rare ...

and the mental model we've been using for filtering was that since Illumina reads are error-prone -- especially on the 3' side -- removal of sequences containing low abundance k-mers would help screen out sequences containing errors. The logic here is that if you break a sequence, like

atggAgtac

down into (say) 4-mers,

atgg
 tggA
  ggAg
   gAgt
    Agta
     gtac

and there was a single sequencing error (the 'A') in the original sequence, then all four k-mers containing the 'A' would be at low abundance in the data set. Set k=32 and you see the potential for lots and lots of low-abundance k-mers.

So I figured that the dominant effect of k-mer filtering would be removal of erroneous sequence.

It turns out I was dead wrong, and that we're actually just removing genuinely low-abundance sequences, presumably from genuinely low-abundance DNA present in the sample we're sequencing. But we took a circuitous route to get there. This is that story.

The k-mer abundance profile problem

We're interested in two things: first, eliminating reads with early errors in them; and second, trimming reads to remove the error-prone 3' end. For read trimming, we didn't know where to start trimming, so we figured that we could develop an empirical threshold by looking at the k-mer abundance profile across the read, and then choosing a point in the reads that initiated a run of a lot of low-abundance k-mers. This would be the place where the Illumina sequencing was breaking down.

So we plotted the summed abundance of 32-mers across the entire data set against their position in reads using our khmer magic, and got the green line on the following plot:

http://ivory.idyll.org/permanent/illumina-read-phenomenology/average-abundance-by-pos.png

OK, the green line doesn't show any drop in average abundance, so maybe we won't be able to trim using low-abundance k-mers as a signal for dropping quality. Minorly surprising but not stunning.

Unfortunately, Jason also plotted the average abundance of 17-mers -- red line, above plot.

It rises continuously throughout the read, suggesting that Illumina reads trend towards high-abundance 17-mers at their 3' end. WTF??

...this is not at all what you'd expect if errors were causing novel k-mers to appear at the end of reads.

Looking between the two graphs, we immediately suspected that homopolymer runs were popping up. That is, at the end of reads, erroneous runs of AAAAAA.... or CCCCC... etc. were read off by the Illumina sequencer or base calling system, resulting in a lot of low-complexity subsequences. For small k (=17), this would result in lots and lots of identical 17-mers; for larger k (=32), the homopolymer runs would be connected to more unique sequence earlier in the read and the 32-mers wouldn't all be unique.

To track this down, we first decided to look at the distribution of low (abundance = 1) and high (abundance >= 255) abundance k-mers across reads.

High- and low-abundance kmers, by position

When we looked at where unique k-mers tended to lie in the read, we found that there was a nearly uniform distribution of them (first graph, green line). Sure, there's a small uptick at the end, but if you look at the left axis, you can see that it's only about 10% -- not at all what you'd expect if the reads were very error-prone at the 3' end. The real signal is in the high-abundance k-mers, which have a huge predisposition to being at the 3' end of the read (second graph, green line). This suggested that our homopolymer mechanism for explaining the rise of the summed 17-mer abundances towards the 3' end of the reads was correct -- and, indeed, when we went in and looked at what exact sequences were present in high abundance, we saw a bunch of homopolymer runs.

http://ivory.idyll.org/permanent/illumina-read-phenomenology/abund=1.compare.png http://ivory.idyll.org/permanent/illumina-read-phenomenology/abund=255.compare.png

We verified that these high-abundance k-mers were predominantly homopolymer runs by using our Mark 1 Eyeball Detection System; here's the top of the list of abundance >= 255 k-mers:

ACAATGGATCAACAACGACAATGGATCAACAA
TGAGGCGGGGGTCACCCTGAGGCGGGGGTCAC
AATCTCATGCCTCAGCCAATCTCATGCCTCAG
CCCCCCCCCCCCCCCCCCCCCCCCTCCCCCCC
AAACCAAAAAAAAAAAAAAACAAAAAAAAAAA
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
GGGGGGGGGGGGGAGGGGGGGGGGGGGGGGAC
AAAAAAACAAAAAAAAACAAAAAACAAAAAAA
GGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGG
CACCGTAGGCACCTTGGCACCGTAGGCACCTT
GGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGG
AGAGTGTAGATCTCGGTGGTCGCCGTATCATT
GGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGG
TGTAGGGAAAGAGTGTAGATCTCGGTGGTCGC
CTCCCCCCCCCCCCCCCCGCCCCCCCCCCCCC
GCGGGGGGGGGGGGGGGGCAGGGGGGGGGGGG
CCACCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCGCCCCCCCCCCCCCCCCCCCC
AAAAAAAAAAACCAAAAAAAAAAAAAAAACAA
AAAAAAACAAAAACAAAAAAAAAAAAAAAACA
GGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGG
CGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGA
CCCCCCACCCCCCCCACCCCCCCACCCCCCCC
GGGGGGGGGGGGGGCGGGGGGCGGGGGGGGGC

So, in the raw reads from the sequencing facility, we see a lot of homopolymer runs on the 3' end, and not much in the way of unique k-mers (indicating that there's very little single-base error.)

Trimming reads by error scores

At this point, our work on k-mers converged with some work that Adina was doing. She was looking at trimming by quality score; the more recent base calling system used by Illumina puts in phred-score=2 markers at positions within the read where the sequence quality has deteriorated to the point where it shouldn't be trusted any more. We decided to see if trimming all sequences at such positions would lead to "better" (more uniform) k-mer distributions. We therefore looked at the low- and high-abundance k-mer distributions, as above, but for quality-trimmed sequence. We got the red lines in the plots above.

For the high-abundance k-mers, we see a near-perfect leveling of the k-mer abundance distribution in the quality-trimmed reads. This indicates that the Illumina phred-score=2 prediction is working nearly perfectly for predicting the start of poor sequence stretches, as indicated by homopolymer runs. This distribution is more or less what you'd expect of good-quality sequence.

The distribution of low abundance k-mers, however, is much more puzzling. Unique k-mers are dropping in number as reads get longer. This shouldn't happen unless the 5' end of the read is much more error-prone than the 3' end -- generally not what you'd expect! The explanation for this phenomenon may be that Illumina is over-trimming the reads: that is, in an effort to get rid of low-quality sequence, they're putting in a bias against good quality scores at the 3' end of their reads. This results in a systematic over-trimming of reads by our program, and my guess is that we're probably throwing away some good sequence here. (A casual conversation with an Illumina representative at a recent conference suggests that this is in fact what's going on.)

K-mer abundance histogram

The suspicious among you might note the decline in the red lines (trimmed data) on the right side of the low- and high-abundance k-mer graphs, and period-5 bounciness in the same red lines. Are we just getting rid of long reads in the trimming process? And whence the bounciness? Well, first, most of the reads remain untrimmed, so we can't account for declining signal; and the periodicity is actually caused by a periodicity in the length distribution of the trimmed reads:

http://ivory.idyll.org/permanent/illumina-read-phenomenology/length-dist-filtered.png

(This graph is actually by # of k-mers at each position, not by read length; add 31 to the bottom axis for technical correctness.)

It's interesting to look at the periodicity in the trimmed reads. It gives us some insight into the Illumina quality score calculation code: there must be some Illumina error calculation system that is dependent on blocks of five bases, although why they suggest trimming at positions that are exact multiples of five is an interesting question.

The homopolymer issue

So, it seems that Illumina reads are actually more prone to homopolymer runs than anything else, AND the per-base error rate is low enough to not result in lots and lots of unique k-mers.

This has some interesting implications for assemblers. In particular, you definitely want to trim 3' ends of Illumina reads before assembly, because otherwise assemblers might choke on the homopolymer runs. You certainly don't want to be paying attention to the homopolymer sequence!

I'd really like to take a look at k-mer abundances across 454 reads. Soon, perhaps.

Conclusions

My main conclusion from all of this is: in our data set, unique k-mers are not predominantly caused by errors. Rather, they are simply low-abundance reads from a complex population. There's therefore no scientific point to removing reads containing low-abundance k-mers, or trimming reads at the first unique k-mer. It's simply a way to remove data so as to favor high-abundance organisms, which might be more assemblable. (Which is fine, but it should be explicitly understood!)

This conclusion is almost certainly not true of lower-complexity samples, like mRNAseq, ChIP-seq, or genome resequencing.

A secondary conclusion is that k-mer abundance analysis is an excellent way to look for patterns within shotgun reads. Since patterns shouldn't generally be there, anything you see is suspicious and should be investigated further.

--titus

Appendix: the scripts

A pointer to some data, and all the scripts and code needed to generate the above graphs, is here:

http://github.com/ctb/khmer/blob/master/doc/blog-posts.txt

posted at: 10:43 | path: /jul-10 | 7 comments

Tags: ,


Sun, 25 Jul 2010

Terabase metagenomics -- the computational side


The Terabase Metagenomics meeting was good fun, but I most valued the computational component (because that's what I do). Rachel Mackelprang and Rob Knight and I wrote down a list of the computational issues involved in a petabase metagenomics project, and that list will help direct my future research. I'll try to write up the list sooner rather than later, but in the meantime, I wanted to write down some general thoughts on the subject of processing large amounts of data for metagenomics.

In any petabase metagenomics sequencing project focused on probing the unknown, we are going to have to face two basic computational issues: assembly and sequence comparison. Both of these issues require an all-by-all analysis, or at least the approximation of one. That is, as our sampling grows, we're going to want to assemble contiguous DNA sequences from larger and larger pools of short sequences; this requires some form of comparison that will scale as the amount of genomic novelty increases. We're also going to want to take the products of sequencing and assembly (raw reads and contigs) and compare them or derivative products to all known genes. By and large, this problem boils down to systematic comparisons of points in a sparsely but maybe thoroughly occupied n-dimensional space -- an N-squared problem that has no exact parallelizable solution.

Currently, neither assembly nor gene comparison scales well. Metagenomic assembly is currently the most disastrous: existing short-read assemblers use De Bruijn graphs and require ginormous amounts of memory (half-terabyte amounts or more) to run. Gene comparison is largely done using custom tools (for e.g. rRNA comparison) or the more general BLAST and HMMER/profile tools; these tools scale reasonably well for comparisons, but for post-processing into high-quality phylogenies and gene families, there is no scalable solution that I know of.

I firmly believe that the intractable-looking problem of assembly is going to be among the first to fall. This is because, at least in theory, metagenomic assembly should be massively parallelizable: metagenomes contain many distinct genomes that don't connect to each other, so if you can filter or bin the raw sequences into assemblable subsets, then you will be able to run the individual assemblies on as many computers as you have original contigs. This is a logical and simple approach that simply awaits appropriate heuristic techniques for division of the reads -- and we'll figure that out sooner rather than later. Plus, people much smarter than me are actively working on better ways of parallelizing the basic De Bruijn graph approach, and that will inevitably bear fruit.

The marker-gene comparison problem may already be solved. For any given gene family (like 16s) you can already do linear-time comparisons with profile methods like HMMER. Using existing breakdowns of phase space as a scaffold for new marker data as it arrives should be possible. At least, I can think of several ways that it will work, and Rob Knight is quite optimistic. (This is his area of expertise, and since he's a Python programmer, I'm inclined to give him the benefit of the doubt. ;)

The big problem facing in the future, I think, is the clustering and discovery of gene families. As we sequence deeper into the unknown, we're going to have to do ever larger, more systematic, and more sensitive comparisons of new data against all known genes, in all types of ways -- DNA and protein sequence, RNA secondary structure, and protein structure. The goal will be to elucidate gene families (which is an ill-defined concept), their phylogenetic structure, and as much as possible about their function, across all of the domains of life. This is hard because we are explicitly looking for trace connections -- distant homologies and similarities -- that we haven't seen yet. Unlike the marker-gene approach, we don't have good heuristics for reducing the problem to one of interpolation, and this means that the problem remains (at first glance) an O(N^2) problem. Even if you use a prefiltering approach (like BLAST or HMMER 3) you are still supra-linear, which is deadly when you're talking about billions or trillions of sequences.

Memory usage is particularly problematic for gene comparison, with most clustering techniques requiring scads of memory (N^2, in the extreme) for comparison problems. This may be something we can reduce with heuristic approaches, but biology has a way of messing with us so we can't rely on heuristics up front - especially in large discovery-based projects.

There's an additional complication, though: we really want to reduce the problem to one of online processing. As we gather petascale data, we don't want to do all-by-all comparisons of the new data against the old data; rather, we want to fit as much of the new data into an existing structure of classifications, and then sideline the remaining data for more complex exploration. This results in an O(N) classification for the "easy" or known stuff, while dealing with the hard stuff remains Bad (supra-linear, and perhaps exponential). For example, with respect to assembly, we expect to have assemblies of most of the more abundant soil genomes within the next 2-3 years; we should be able to pre-screen new data by seeing if it matches to an existing assembly, and if it does, simply count it (or discard it). Only the data that matches to no prior assembly would be kept. A similar approach can be used for 16s or marker-gene data. I can think of ways to approach it for protein-coding genes and non-coding RNA as well, but I'm very afraid of missing some or many of the unknown unknowns in our data sets by being too liberal with classification. On the flip side, being too conservative will inevitably lead to scaling problems, so there's no way to win.

As with all Big Data problems, the sheer scale of these issues causes all sorts of annoying data processing and storage problems. It's become a ruling principle of my tool development that algorithms must be constant memory: hence things like our k-mer filtering approach, which trades accuracy guarantees for pragmatic issues like the ability to run the software in the first place.

It's worth noting that Moore's Law will not save us. Neither memory nor CPU power is growing as fast as sequencing capacity -- some estimate the exponent on sequencing capacity as being > 5. (Thank goodness, disk space will probably keep up!) We're going to get a huge increase in the number of cores, but that doesn't really help with Big Data unless you have robust pleasantly parallel algorithms that don't require much memory -- which we don't, and in general probably won't, have.

So, my (our) outlook for the next few years will be: build tools that scale way better than our existing tools, in all directions (time, memory, parallelizability); yield guaranteed behavior with respect to memory consumption in particular; and are optimally sensitive and specific for the large class of unknowns (new proteins, ncRNA, and structures) that await us in the world of sequence.

Sounds like fun!

By the way, for those of you who like to think about problem equivalence, the protein comparison problem ultimately boils down to comparing protein structures. We, umm, can't solve that one, either...

--titus

posted at: 07:37 | path: /jul-10 | 0 comments

Tags: ,


Sat, 24 Jul 2010

Terabase metagenomics - meeting and outlook


I'm on my way back from the Terabase Metagenomics meeting in Snowbird, UT, and I'm buzzing with ideas about how to move forward in metagenomics and bioinformatics research. Metagenomics, the use of genomics approaches to study microbial communities, has been opening up as sequencing drops in price. With sequencing becoming ever more ridiculously deep and inexpensive -- we're looking at a power of 2-5 increase in depth every year for the next 5 years -- the questions have increasingly become intertwined with (surprise!) computational science problems and specifically bioinformatics. As we move towards multiple individual terabase (and collectively petabase) sequencing data sets in the next two years, life is going to become more complicated for metagenomicists and the bioinformaticians who aspire to help them. (That's me.)

The meeting was focused on the promise of terabase metagenomics for a "whole earth" microbial ecology sampling projects, and we spent quite a lot of time trying to figure out the size of unknown unknowns in metagenomics. There was a great mix of people at the meeting -- both "pure" microbial ecologists who regarded computers with suspicion and computer scientists who didn't understand how DNA became protein, as well as the entire range between those two poles. We talked about questions such as: "how deep do we need to sequence in order to get a real picture of microbial diversity?" (a: very) and "how do we reconcile genomic and transcriptomic data sets?" (a: unknown) and "how can we scale up 1000-fold when our existing algorithms are all N-squared?" (a: with difficulty.) The discussion was at a reasonably high level of sophistication: pretty much everyone in the room had "seen the elephant", and representatives of the major analysis pipelines were also present.

We didn't come to any firm conclusions about the biology, but we did sketch the outlines of a global sampling project to look at global microbial diversity, together with the deeper sequencing of specific locales to plumb the depths of a few communities. Coming up with a computational pipeline to deal with data and metadata on this scale -- storage, analysis, integration, presentation, and querying -- is truly challenging and should be fun.

For me, one recurring theme in the meeting was entanglement with management and standardization issues in big projects. Everyone was bitching about these on their existing projects, and I'm trying to avoid them. I have little or no interest in telling other people how to do their job; absolutely no interest in being told how to do my job; and a general lack of patience with standardization issues, however important and critical they may be. (In some ways, the Benevolent Dictator For Life approach to decision making used by the Python community seems optimal for medium-sized projects, although of course it takes a special person to pull it off. I acknowledge that decisions need to be made, but have no real interest in doing anything more myself than exploring the technical issues that could inform the decisions. Hmm, I wonder if the PEP process might be a good way to do a formal standards process in metagenomics?) I don't if there's any way for me to avoid this stuff in the future, but it's the main reason that I'm wary of joining big projects.

A few interesting references came up during the meeting. For example, I hadn't seen the Heilmeier Catechism before; I think it's a good idea for both scientific and open source projects to look at this before they do too much work! I was also entertained by Jonathan Eisen's report on things that the workshop participants disliked: for example, "Bureaucracy (and how it impedes science)."

All in all, I think this meeting was fantastic: very small, very focused, and with plenty of time for discussion and socializing. I'm happy the organizers invited me, so thanks to them!

--titus

posted at: 15:51 | path: /jul-10 | 2 comments

Tags: ,


Sun, 11 Jul 2010

Casual mentions


(Everyone needs an "I'm wonderful" wall, right? Here's mine!)

My rather snarky post on data management made it into Episode 34 of the Coast to Coast Bio Podcast.

Turns out I'm now a leading evangelist for cloud computing in genomic research, which I guess is true if evangelism is relative! My evangelism, plus 2 bucks, may get me a cup of coffee and a grant...

The long and detailed k-mer filtering post elicited a great comment from Nick Loman:

with blogposts like this, who needs peer-review journals?

Unfortunately my blog posts don't count for promotion & tenure, so that would be me needing peer-review journals ;). I also am sadly amused that one of the most content-ful blog posts I've written recently got very little notice. It's frustrating that people seem to prefer to kibitz about open access/open source more than actually blogging in detail about what they're doing at a technical level. We need more of that.

Greg Wilson just posted an interview with me about why Michigan State, GEDD, and I are all interested in Sotware Carpentry. I told him to stop bothering me and just read my several blog posts on the subject, but he out-persisted me. Grumble.

And finally, next week I'm going to start a blog-off with Jonathan Eisen as we'll both be at a Terabase Metagenomics meeting. (He doesn't know about the blog-off yet, so don't tell him, ok?) Points will be awarded on --

  • first blogpost on the worst -omics word comign from the meeting (we re-invented "predictomics" in my lab last Friday!)
  • snarkiest comment
  • quickest tweet off the mark ("just 10 seconds ago the speaker said this!")
  • least scientifically relevant tweet

I'm hoping to gin up a solid lead before Jonathan realizes we're competing. Shhh.

--titus

posted at: 07:44 | path: /jul-10 | 0 comments


Wed, 07 Jul 2010

A memory efficient way to remove low-abundance k-mers from large (metagenomic?) DNA data sets


I've spent the last few weeks working on a simple solution to a challenging problem in DNA sequence assembly, and I think we've got a nice simple theoretical solution with an actual implementation. I'd be interested in comments!

Introduction

Briefly, the algorithmic challenge is this:

We have a bunch of DNA sequences in (let's say) FASTA format,

>850:2:1:1145:4509/1
CCGAGTCGTTTCGGAGGGACCCCGCCATCATACTCGGGGAATTCATCTGAAAGCATGATCATAGAGTCACCGAGCA
>850:2:1:1145:4509/2
AGCCAAGAGCACCCCAGCTATTCCTCCCGGACTTCATAACGTAACGGCCTACCTCGCCATTAAGACTGCGGCGGAG
>850:2:1:1145:14575/1
GACGCAAAAGTAATCGTTTTTTGGGAACATGTTTTCATCCTGATCATGTTCCTGCCGATTCTGATCTCGCGACTGG
>850:2:1:1145:14575/2
TAACGGGCTGAATGTTCAGGACCACGTTTACTACCGTCGGGTTGCCATACTTCAACGCCAGCGTGAAAAAGAACGT
>850:2:1:1145:2219/1
GAAGACAGAGTGGTCGAAAGCTATCAGACGCGATGCCTAACGGCATTTTGTAGGGCGTCTGCGTCAGACGCCAACC
>850:2:1:1145:2219/2
GAAGGCGGGTAATGTCCGACAAACATTTGACGTCAAAGCCGGCTTTTTTGTAGTGGGTTTGACTCTTTCGCTTCCG
>850:2:1:1145:5660/1
GATGGCGTCGTCCGGGTGCCCTGGTGGAAGTTGCGGGGATGCGGATTCATCCGGGACGCGCAGACGCAGGCGGTGG

and we want to pre-filter these sequences so that only sequences containing high-abundance DNA words of length k ("k-mers"), remain. For example, given a set of DNA sequences, we might want to remove any sequence that does not contain a k-mer present at least 5 times in the entire data set.

This is very straightforward to do for small numbers of sequences, or for small k. Unfortunately, we are increasingly confronted by data sets containing 250 million sequences (or more), and we would like to be able to do this for large k (k > 20, at least).

You can break the problem down into two basic steps: first, counting k-mers; and second, filtering sequences based on those k-mer counts. It's not immediately obvious how you would parallelize this task: the counting should be very quick (basically, it's I/O bound) while the filtering is going to rely on wide-reaching lookups that aren't localized to any subset of k-mer space.

tl; dr? we've developed a way to do this for arbitrary k, in linear time and constant memory, efficiently utilizing as many computers as you have available. It's open source and works today, but, uhh, could use some documentation...

Digression: the bioinformatics motivation

(You can skip this if you're not interested in the biological motivation ;)

What we really want to do with this kind of data is assemble it, using a De Bruijn graph approach a la Velvet, ABySS, or SOAPdenovo. However, De Bruijn graphs all rely on building a graph of overlapping k-mers in memory, which means that their memory usage scales as a function of the number of unique k-mers. This is good in general, but Bad in certain circumstances -- in particular, whenever the data set you're trying to assemble has a lot of genomic novelty. (See this fantastic review and my assembly lecture for some background here.)

One particular circumstance in which De Bruijn graph-based assemblers flail is in metagenomics, the isolation and sequencing of genetic material from "the wild", e.g. soil or the human gut. This is largely because the diversity of bacteria present in soil is so huge (although the relatively high error rate of next-gen platforms doesn't help).

Prefiltering can help reduce this memory usage by removing erroneous sequences as well as not-so-useful sequences. First, any sequence arising as the result of a sequencing error is going to be extremely rare, and abundance filtering will remove those. Second, genuinely "rare" (shallowly-sequenced) sequences will generally not contribute much to the assembly, and so removing them is a good heuristic for reducing memory usage.

We are currently playing with dozens (probably hundreds, soon) of gigabytes of metagenomic data, and we really need to do this prefiltering in order to have a chance at getting a useful assembly out of it.

It's worth noting that this is in no way an original thought: in particular, the Beijing Genome Institute (BGI) did this kind of prefiltering in their landmark Human Microbiome paper (ref). We want to do it, too, and the BGI wasn't obliging enough to give us source code (booooooo hisssssssssssssss).

Existing approaches

Existing approaches are inadequate to our needs, as far as we can tell from a literature survey and private conversations. Everyone seems to rely on big-RAM machines, which are nice if you have them, but shouldn't be necessary.

There are two basic approaches.

First, you can build a large table in memory, and then map k-mers into it. This involves writing a simple hash function that converts DNA words into numbers. However, this approach scales poorly with k: for example, there are 4**k unique DNA sequences of length k (or roughly (4**k) / 2 + (4**(k/2))/2, considering reverse complements). So the table for k=17 needs 4**17 entries -- that's 17 gb at 1 byte per counter, which stretches the memory of cheaply available commodity hardware. And we want to use bigger k than 17 -- most assemblers will be more effective for longer k, because it's more specific. (We've been using k values between 30 and 70 for our assemblies, and we'd like to filter on the same k.)

Second, you can observe that k-mer space (for sufficiently large k) is likely to be quite sparsely occupied -- after all, there's only so many actual 30-mers present in a 100gb data set! So, you can do something clever like use a tree representation of k-mers and then attach counters to the end nodes of the tree (see, for example, tallymer. The problem here that you need to use pointers to connect nodes in the tree, which means that while the tree size is going to scale with the amount of novel k-mers -- ok! -- it's going to have a big constant in front of it -- bad!. In our experience this has been prohibitive for real metagenomic data filtering.

These seem to be the two dominant approaches, although if you don't need to actually count the k-mers but only assess presence or absence, you can use something like a Bloom filter -- for example, see Classification of DNA sequences using a Bloom filter, which uses Bloom filters to look for novel sequences (the exact opposite of what we want to do here!). References to other approaches welcome...

Note that you really, really, really want to avoid disk access by keeping everything in memory. These are ginormous data sets and we would like to be able to quickly explore different parameters and methods of sequence selection. So we want to come up with a good counting scheme for k-mers that involves relatively small amounts of memory and as little disk access as possible.

I think this is a really fun kind of problem, actually. The more clever you are, the more likely you are to come up with a completely inappropriate data structure, given the amount of data and the basic algorithmic requirements. It demands KISS! Read on...

An approximate approach to counting

So, we've come up with a solution that scales with the amount of genomic novelty, and efficiently uses your available memory. It can also make effective use of multiple computers (although not multiple processors). What is this magic approach?

Hash tables. Yep. Map k-mers into a fixed-size table (presumably one about as big as your available memory), and have the table entries be counters for k-mer abundance.

This is an obvious solution, except for one problem: collisions. The big problem with hash tables is that you're going to have collisions, wherein multiple k-mers are mapped into a single counting bin. Oh noes! The traditional way to deal with this is to keep track of each k-mer individually -- e.g. when there's a collision, use some sort of data structure (like a linked list) to track the actual k-mers that mapped to a particular bin. But now you're stuck with using gobs of memory to keep these structures around, because each collision requires a new pointer of some sort. It may be possible to get around this efficiently, but I'm not smart enough to figure out how.

Instead of becoming smarter, I reconfigured my brain to look at the problem differently. (Think Different?)

The big realization here is that collisions may not matter all that much. Consider the situation where you're filtering on a maximum abundance of 5 -- that is, you want at least one k-mer in each sequence to be present five or more times across the data set. Well, if you look at the hash bin for a specific k-mer and see the number 4, you immediately know that whether or not there are any collisions, that particular k-mer isn't present five or more times, and can be discarded! That is, the count for a particular hash bin is the sum of the (non-negative) individual counts for k-mers mapping to that bin, and hence that sum is an upper bound on any individual k-mer's abundance.

http://ivory.idyll.org/permanent/kmer-hashtable.png

The tradeoff is false positives: as k-mer space fills up, the hash table is going to be hit by more and more collisions. In turn, more of the k-mers are going to be falsely reported as being over the minimum abundance, and more of the sequences will be kept. You can deal with this partly by doing iterative filtering with different prime hash table sizes, but that will be of limited utility if you have a really saturated hash table.

Note that the counting and filtering is still O(N), while the false positives grow as a function of k-mer space occupancy -- which is to say that the more diversity you have in your data, the more trouble you're in. That's going to be a problem no matter the approach, however.

You can see a simple example of approximate and iterative filtering here, for k=15 (a k-mer space of approximately 1 billion) and hash table sizes ranging from 50m to 100m. The curves all approach the correct final number of reads (which we can calculate exactly, for this data set) but some take longer than others. (The code for this is here.)

http://ivory.idyll.org/permanent/kmer-filtering-iterative.png

Making use of multiple computers

While I was working out the details of the (really simple) approximate counting approach, I was bugged by my inability to make effective use of all the juicy computers to which I have access. I don't have many big computers, but I do have lots of medium sized computers with memory in the ~10-20 gb range. How can I use them?

This is not a pleasantly parallel problem, for at least two reasons. First, it's I/O bound -- reading the DNA sequences in takes more time than breaking them down into k-mers and counting them. And since it's really memory bound -- you want to use the biggest hash table possible to minimize collisions -- it doesn't seem like using multiple processors on a single machine will help. Second, the hash table is going to be too big to effectively share between computers: 10-20 gb of data per computer is too much to send over the network. So what do we do?

I was busy explaining to a colleague that this was impossible -- always a useful way for me to solve problems ;) -- when it hit me that you could use multiple chassis (RAM + CPU + disk) to decrease the false positive rate with only a small amount of communication overhead. Basically, my approach is to partition k-mer space into Z subsets (one for each chassis) and have each computer count only the k-mers that fall into their partition. Then, after the counting stage, each chassis records a max k-mer abundance per partition per sequence, and communicates that to a central node. This central node in turn calculates the max k-mer abundance across all partitions.

The partitioning trick is a more general form of the specific 'prefix' approach -- that is, separately count and get max abundances/sequence for all k-mers starting with 'A', then 'C', then 'G', and then 'T'. For each sequence you will then have four values (the max abundance/sequence for k-mers start with 'A', 'C', 'G', and 'T'), which can be cheaply stored on disk or in memory. Now you can do a single-pass integration and figure out what sequences to keep.

This approach effectively multiplies your working memory by a factor of Z, decreasing your false positive rate equivalently - no mean feat!

http://ivory.idyll.org/permanent/kmer-hashtable-par.png http://ivory.idyll.org/permanent/kmer-filter-process-2.png

The communication load is significant but not prohibitive: replicate a read-only sequence data set across Z computers, and then communicate max values (1 byte for each sequence) back -- 50-500 mb of data per filtering round. The result of each filtering round can be returned to each node as a readmask against the already-replicated initial sequence set, with one bit per sequence for ignore or process. You can even do it on a single computer, with a local hard drive, in multiple iterations.

You can see a simple in-memory implementation of this approach here, and some tests here. I've implemented this using readmask tables and min/max tables (serializable data structures) more generally, too; see "the actual code", below.

Similar approaches

By allowing for false positives, I've effectively turned the hash table into a probabilistic data structure. The closest analog I've seen is the Bloom filter which records presence/absence information using multiple hash functions for arbitrary k. The hash approach outlined above devolves into a maximally efficient Bloom filter for fixed k when only presence/absence information is recorded.

The actual code

Theory and practice are the same, in theory. In practice, of course, they differ. A whole host of minor interface and implementation design decisions needed to be made. The result can be seen at the 'khmer' project, here: http://github.com/ctb/khmer. It's slim on documentation, but there are some example scripts and a reasonable amount of tests. It requires nothing but Python 2.6 and a compiler; nose if you want to run the tests.

The implementation is in C++ with a Python wrapper, much like the paircomp and motility software packages.

It will filter 1m 70-mer sequences in about 45 seconds, and 80 million sequences in less than an hour, on a 3 GHz Xeon with 16 gbs of RAM.

Right now it's limited to k <= 32, because we encode each DNA k-mer in a 64-bit 'long long'.

You can see an exact filtering script here: http://github.com/ctb/khmer/blob/master/scripts/filter-exact.py . By varying the hash table size (second arg to new_hashtable) you can turn it into an inexact filtering script quite easily.

Drop me a note if you want help using the code, or a specific example. We're planning to write documentation, doctests, examples, robust command line scripts, etc. prior to publication, but for now we're primarily trying to use it.

Other uses

It has not escaped our notice that you can use this approach for a bunch of other k-mer based problems, such as repeat discovery and calculating abundance distributions... but right now we're focused on actually using it for filtering metagenomic data sets prior to assembly.

Acks

I talked a fair bit with Prof. Rich Enbody about this approach, and he did a wonderful job of double-checking my intuition. Jason Pell and Qingpeng Zhang are co-authors on this project; in particular, Jason helped write the C++ code, and Qingpeng has been working with k-mers in general and tallymer in specific on some other projects, and provided a lot of background help.

Conclusions

We've taken a previously hard problem and made it practically solvable: we can filter ~88m sequences in a few hours on a single-processor computer with only 16gb of RAM. This seems useful.

Our main challenge now is to come up with a better hashing function. Our current hash function is not uniform, even for a uniform distribution of k-mers, because of the way it handles reverse complements.

The approach scales reasonably nicely. Doubling the amount of data doubles the compute time. However, if you have double the novelty, you'll need to do double the partitions to keep the same false positive rate, in which case you quadruple the compute time. So it's O(N^2) for the worst case (unending novelty) but should be something better for real-world cases. That's what we'll be looking at over the next few months.

I haven't done enough background reading to figure out if our approach is particularly novel, although in the space of bioinformatics it seems to be reasonably so. That's less important than actually solving our problem, but it would be nice to punch the "publication" ticket if possible. We're thinking of writing it up and sending it to BMC Bioinformatics, although suggestions are welcome.

It would be particularly ironic if the first publication from my lab was this computer science-y, given that I have no degrees in CS and am in the CS department by kind of a fluke of the hiring process ;).

posted at: 08:09 | path: /jul-10 | 8 comments

Tags: , ,