Compressing next-gen sequencing data, and the next sequence squeeze competition

At BOSC 2012, we heard a report from Richard Holland on the Pistoia Alliance Sequence Squeeze competition. I'd run across this a couple of times before -- most notably in the Quip paper -- and was interested in hearing the results.

What was the problem being tackled? To quote,

The volume of next-generation sequencing data is a big problem. Data volumes are growing rapidly as sequencing technology improves. Individual runs are providing many more reads than before, and decreasing run times mean that more data can today be generated by a single machine in one day than a single machine could have produced in the whole of 2005.

Storing millions of reads and their quality scores uncompressed is impractical, yet current compression technologies are becoming inadequate. There is a need for a new and novel method of compressing sequence reads and their quality scores in a way that preserves 100% of the information whilst achieving much-improved linear (or, even better, non-linear) compression ratios.

In the end, though, I found the results disappointing. The main conclusions?

"First, that the gzip algorithm that served as the competition baseline is actually quite sufficient for run-of-the-mill compression, and second, that it's extraordinarily difficult to make huge improvements in all three of the judged dimensions," said Nick Lynch, external liaison of the Pistoia Alliance and chair of the judging panel.

In some sense, this is pretty reassuring -- if you're compressing data, it's nice that gzip is hard to beat! Even in situations like this, where you have domain-specific data, gzip can compete reasonably well. Better compression algorithms do exist, but there's a severe time/space tradeoff.

In another sense, though, it's insane. We should be able to compress the hell out of most short-read sequencing data! Short-read sequencing data is massively redundant for two reasons. First, because to achieve sufficient coverage with shotgun sequencing you need to sample deeply on average. For example, if you have a 100x coverage genomic sample, then more than half of your content is represented more than 100 times in your data set. Second, because most of the time what you're sequencing has already been sequenced before -- if you're looking for genomic variants (in resequencing data), you expect most of the data to be identical to what you already have. That is, if you are trying to figure out how my genome is different from, say, Craig Venter's, you can rely on the fact that, as two white guys, we're probably about 99.99% identical at the genomic level. Even if you're sequencing an Australian aboriginal you can guess that our genomes would be 99.9% identical (this number is made up -- corrections welcome. Is it 99.8 instead? ;) More generally, the vast majority of sequencing is biomedical (because that's where the money is) and with only a few exceptions, the vast majority of that data is going to be from a few well-sequenced species, and (from an information content perspective) essentially the same as what everyone else is sequencing. Even if you want to keep all the single-nucleotide variants and predicted insertion-deletion variants, we're talking about a few megabytes per data set -- not hundreds.

Some short-read data is more interesting - there are some very interesting applications of sequencing, mostly in environmental, ecological, and evolutionary studies, for which the theoretical redundancy of the data sets is lower than that for straight genomic or biomedical sequencing. For example, I would venture to say that our current soil metagenomic data sets are only about 80-90% compressible; this is compared to the common genomic and RNAseq data sets, which should be (info-theoretically) between 99-99.9% compressible. (Footnote: this is my mental calculation for the ratio between an appropriately 2-bits-per-base uncompressed storage, and the maximal information theoretic lossless compression. There's some fancier term for this that I'm forgetting...)

But the main point is this -- most short-read sequencing data is crazy redundant, WAY past the few 30x maximum compression ratio achieved by even the best of the sequence squeeze participants.

So, why didn't the sequence squeeze competitors do better?

The answer is simple: FASTQ. Specifically, quality scores.

A digression: how Illumina sequencing works, kinda

So, you have some DNA. And you want to sequence it.

One very common way is to use an Illumina machine, like so:

You fragment the DNA into molecules of length ~300-600 bases.
You stick the DNA randomly to a 2-D surface.
You amplify the DNA where it sits, so now you have lots of mostly identical fragments of DNA in close proximity to each other.
You strip the DNA down to single-stranded fragments.
Then you repeatedly add individually fluorescently colored A, C, G, and T bases to every single-stranded DNA molecule on the surface, thus reading off the next base of sequence on the opposite strand.
As you do #5, you image the 2-D surface with a microscope.
You take the resulting movie and simply "read" the DNA sequence present at each spot on the 2-D surface off from the twinkling little lights at each spot.
These DNA sequences are transcoded into strings of DNA, together with a quality score that correlates with how confident the reading algorithm in #6 is about the identify of each particular character of DNA. This is all put into a FASTQ file, which is one "output" of the whole process.

You can view this as an incredibly complex but ridiculously effective way to digitize DNA into bits, and it's how most sequencing is done these days. (There are many other sequencing schemes, and no doubt this will change. But as of October 2012, Illumina is by far the most used sequencing approach in terms of volumes of data.)

Back to compression

The problem that the Sequence Squeeze participants ran into is, basically, that the terms of the context specified lossless compression -- whatever process you used had to retain all the bytes. This is really troublesome because of one main thing -- the noise inherent in the data.

First, the FASTQ format specifies a range for quality scores of between 60-90 in length, and the information that these quality scores encode is noisy. Noise is hard to encode efficiently, and the more resolution you bring to bear on the noise, the less compression you can apply. While the ultimate compression you can do on the 60-character quality-score alphabet is going to be very dependent on the frequency distribution of those 60 characters, no matter how you cut it a 60-character alphabet is much less encodable than a 4-character alphabet like DNA, especially if there's randomness in the 60-character alphabet.

Second, the DNA sequence itself is noisy, too. The error rate of Illumina is in the 0.1% to 1% range, which automatically limits the compressibility of the data. This noise essentially limits the degree to which you can use prior knowledge (like reference genomes) to compress things, because the noise is random (i.e. uncorrelated with the reference genome). But these limitations are relatively minor compared to the noise in the FASTQ format.

There are (at least) two solutions to these problems. The first is to reduce the alphabet size of the quality scores. The second (more general) solution is to allow lossy compression.

The first is obvious, and is itself a kind of lossy compression, in the sense that along with discarding the noise in the quality scores, you're also going to discard some true information. But I think the second is much more interesting.

Lossy vs lossless compression

The Wikipedia page on lossy compression is excellent. The term essentially applies to discarding information in the service of compression, where "information" is defined information theoretically to mean pretty anything non-random. Lossless compression means you can recover exactly the set of bits you had to begin with; lossy compression means that you lose some of the data. The trick with lossy compression is to pick which of the data you discard in order to minimize the effect on downstream usage.

It turns out we all use lossy compression on a daily basis -- the image and video encoding used for JPEGs and MPEGs are lossy! Most if not all lossy compression schemes allow you to specify how much information you lose, and a cool graphic demonstration of that is available here, for JPEG images. You can compress photographs a surprisingly large amount before they become indistinct!

With images, the JPEG compression algorithm is designed to keep the image as recognizable as possible. Or, to put it another way, JPEG is attempting to discard as much data as possible while retaining as much mutual information as it can between the image and the downstream application -- in this case, your visual cortex and the image recognition capabilities of your brain. If you can throw away 95% of the data and still have your lolcatz, then why not? Again, the key point for lossy compression is downstream application -- what information are you trying to keep, vs which information are you trying to discard?

For sequencing applications, there's actually a pretty solid answer to "what do we want to do with the data?" Scientists generally have one of a few immediate downstream applications in mind -- sequencing things for assembly, resequencing things to find novel variants, and resequencing things to count them -- and it is relatively straightforward to envision what lossy compression would be in these cases. Assembly is an excellent example of lossy compression, in the sense that you are trying to maximize the mutual information between your sequencing data and the predicted genome output by the assembler. For resequencing, the pattern of variants from some reference is, basically, a compressed version of the sequencing data. And, for counting applications, you can summarize counts on a per-molecule or per-region basis.

Put another way, sequencing error is the information you don't care about. You really only care about the bases that might be real. With the high error rate of sequencing, the vast majority of "information" in your data set is actually erroneous with respect to the source biological data. And this puts the Pistoia Sequence Squeeze competition into sharp relief: by insisting that software perform lossless compression, they hamstrung themselves by effectively requiring that all the error be kept around.

What do I think we should be doing?

A pipeline view

If you view sequence analysis as a pipeline -- produce data, subject data to some analysis that leads to a prediction, and use the predictions in further analyses -- then you're probably either confused or concerned right now. Sequencing and bioinformatics is a rapidly evolving field, and new sequencing technologies and analysis packages are being developed up all the time; why would we want to compress our data into a result by saving the output of, say, the GATK toolkit? That would prevent us from taking advantage of improvements to GATK, or competitors to GATK! This thinking leads to the conclusion that I bet motivated the Sequence Squeeze competition: we need to save all the raw data, in case we come up with better ways to analyze it!

Fig 1. A somewhat abstract view of an analysis pipeline.

I'm sympathetic to the idea that we don't want to save all the output from any given analysis package. All too often we are feeding our data into a gigantic black box of an algorithm/software package/blob and we have very little idea of how it will work specifically on our data. Even when we do, we know that it's going to be making some choices with respect to sensitivity and specificity, and perhaps those will be the wrong choices.

However, given the contents of the FASTQ format, it is also pretty crazy to insist that all of that stuff is worth keeping. Complex proprietary algorithms are used to call bases and assign quality scores, and we can be reasonably certain that there are some serious heuristics (i.e. guessing) being applied; choosing to keep the exact number spit out by the sequencer is pretty arbitrary. There is little reason to believe that a 30-character quality score alphabet would be significantly worse than a 60-character quality score alphabet. Heck, in most cases, a variant call or a base call can be made with greater than 99.99% certainty; I bet you could quantize the quality score encoding down to "rock solid" and 7 shades of "less solid" without significantly affecting your base calls.

On the scale of things, I don't think we should make any one part of the pipeline sacred. We should be figuring out how to optimize the data we keep in order to meet specific limits on analysis sensitivity; and, in particular, we should be thinking about how to maximize the re-analyzability of the data in light of the specific sensitivity we need for each downstream application.

For example, if we are sequencing a single diploid individual, we can pretty confidently throw away minor variant information; if you have 100-1 allele calling at a given location your answer is pretty much guaranteed correct. If you're sequencing a population sample, e.g. a tumor, you might have more trouble picking a high threshold; heck, you might even want to keep all that data. But the choice should be made for each application.

The interesting thing is that such an "arbitrary" line in the sand has already been drawn! We don't keep the raw image files from the sequencing movies, which actually contain the maximum amount of digital information; we don't keep them because they're freakin' huge (multiple terabytes per run), and it doesn't make economic sense. It's easier and cheaper to resequence the 0.1% of samples that you need to reanalyze from scratch than it is to keep around the 99.9% of unnecessary data. It's just silly to argue that all the data is sacred, and choosing the FASTQ files as your "perfect" stage is not particularly meaningful.

So what should we keep? I bet we can compress the data down to a certain point, discarding things that are obvious errors, while still retaining all the signal and a certain amount of noise. This would still make it possible to do any downstream analysis you wanted -- just on a vastly decreased set of data.

Our digital normalization approach approximates this for assembly: it discards quite a bit of data and noise while keeping 99.999% of the information. However, diginorm discards the abundance information. Error correction, especially error correction that is tunable to retain polymorphism, would be a more sensible long term approach: it would enable lots of compression while still retaining abundance. What we're working on in my lab is technology to do this efficiently and tunably.

One of the key perspectives that I think our work (partitioning and digital normalization) brings to the table is the idea of prefiltering and eliminating known-bad data, rather than trying to guess at what is going to be good. We are trying to keep all of the information, in as generic a sense as possible, while discarding noise and scaling the problem; whether or not that information is actually used by any particular downstream analysis software is irrelevant. Another way to look at it is that we're trying to discard obvious noise and errors, rather than make specific calls about what is going to be interesting -- see Wikipedia on the LHC Computing Grid for an example of this in particle physics. This perspective is good because it gives us the ability to choose what kinds of tradeoffs we want without being unduly tied to any one downstream analysis platform.

Fig 2. Inserting lossy compression into an analysis pipeline.

It's important to note that this is not a new idea by any means -- I vaguely remember reading about cram a while back. This is a package that Ewan Birney et al. are working on that does reference based compression -- exactly what I'm talking about above wrt human DNA. (Thankfully Nick Loman pointed it out to me when I asked him to review a draft of this post and so I can refer to it properly ;). Cram, like Quip, attempts to make use of assembly techniques to generate a new reference for unmappable reads, while taking the mapped reads back to the reference for reference-based compression. Perhaps most interestingly to me, they are thinking about quantizing quality scores for exactly the reasons I point out above. You can read the full paper here.

What should the next competition be?

I'd love to see an open competition like the Pistoia Sequence Squeeze contest, but focused on lossy compression, as a prefilter to one or more specific downstream packages. That is, provide a way to calculate sensitivity and specificity from the output of (say) GATK, and also measure the time and space complexity of the entire pipeline from raw data through to end analysis. You could of course do this all on the cloud, to ensure a reasonably uniform and widely available computing environment.

On the one hand, this would encourage people to shave data down to the bare minimum such that GATK with the given parameters "got it right" -- not necessarily a good outcome. But, on the other hand, it would show what the true limits were, and it would probably drive a whole bunch of research into what the various tradeoffs were and how to design such algorithms, choose parameters, and evaluate their performance.

Triage, compression, and Big Data

As one BOSC attendee told me, with the size of the data we're starting to generate, we have to figure out how to compress the data, scale or parallelize the algorithms, or discard some of it. Some data will be gold, some will be copper, and some will be dross; it should at least possible to efficiently separate much of the dross from the might-be-gold.

My real interest is in enabling the appropriate economic decisions to be made. Storage and processing time are considerations that should to be discussed and explored for data-driven biology to become less handicapped, and so far we haven't done much of that exploration. I think this could end up having serious and dramatic consequences for clinical data processing down the road, in addition to obvious research benefits.

One caveat is that there are also legal issues surrounding data retention: for example, clinical folk are apparently mandated to keep all their "raw data" (which means FASTQ, not images, apparently). I'm not prepared to take on the social and legal aspects of this proposal :). But even so, you can dump your big data to very cold storage and retain the filtered data for daily or monthly or yearly reanalysis.

Conclusions

I don't have any, really; I'll post updates from our work as they occur. I'd love to hear about other people's attempts to do this (it's not a new idea! see cram, above) and also about why you think it's a horrible idea for NGS analysis...

Special thanks to Nick Loman, Jared Simpson, and Zamin Iqbal for sanity checking and helpful references!

--titus

p.s. After writing all of this, I discovered Ewan Birney's May post on the same topic in which he basically reprises everything I said above. So, um, YEAH. The question is, can our lab put all of that on a streaming basis? Answer: yes :).

p.p.s. Jared Simpson also pointed out this very interesting paper on using the Burrows-Wheeler Transform to compress large genomic databases: Cox et al., Bioinformatics, 2012.

Living in an Ivory Basement Stochastic thoughts on science, testing, and programming.