Fri, 14 Oct 2011
Metagenomics: we have all the answers (they're just all different)
I'm just on my way back from a JGI workshop on metagenome informatics, and I thought I'd take the opportunity to write up a short review.
The workshop was, frankly, excellent. We saw a bunch of talks on metagenome assembly (my current interest) as well as single-cell sequencing approaches, and a whole spate of data analysis platform talks. The basic perspective seems to be this: nobody has the solution to everything, and those who do have some answers seem to get different answers from others. The one possible exception to this was Jill Banfield, who gave a really inspiring talk about how they're using metagenomics on low- and medium-complexity communities to do some excellent science.
(Advertisement: my talk on scaling metagenome assembly is online here, at slideshare.)
The only somewhat dissatisfying note to the whole conference was the overall lack of inter-project cooperation and cross-validation. While some of the analysis platform folk are talking to each other, I don't think they have a good handle on how to really test each other's software, much less combine forces; too much of it all boils down to "trust us -- our approaches work", which is simply not the way to do science. Period.
Pain
"The thing that is driving all of us is pain." -- Victor Markowitz
This single statement probably sums up the workshop best! One of the most obvious tensions in the workshop was between sensitivity of results and scaling. For many microbial populations, it is critical to sequence deeply in order to see rare variants and species; and yet our assembly and annotation tools are, generally speaking, unprepared to handle this volume of data (dozens of Gb, if not Tb). This leads to pain, as we desperately attempt to make use of the data to address our biology.
There were also quite clearly two camps of people. One group had experience with metagenomic data sets of simple- to middling-complexity: think human microbiome, with up to a few hundred species of microbes per sample. The other group was confronting water and most especially soil, which may have hundreds of thousands if not millions of species per sample. The first group was prone to saying things like "it's not that tough a problem, folks! you just need to analyze more sequence! and then you can do anything you want!" to which the second group would then say "yeah, that's the problem, innit? all that sequence?"
I think in a year or so it will be easier to characterize this gap in complexity. We're finally getting results from soil assembly, and it's clear that we need terabases of sequence to get good results; but we don't yet have the ability to quickly & confidently analyze those terabases of sequence. Once we do, we can make quantitative statements illustrating the divide.
It was also nice to hear that everyone had settled on the best possible data processing pipeline at the meeting! (Although perhaps coincidentally, it was almost always their own software.)
Standards
The second day featured a lot of discussion of standards. Many people seemed to have a community standard that, if everyone would just start using theirs, would solve all the problems!!!
Of particular interest to me, Owen White gave a talk where he presented on the Open Source Data Framework. This is basically a central-server NoSQL implementation for storing metadata and data together. Eventually it will support data migration for locality, and lots of other nice features. The response to building Yet Another Big Database (But This Time, With No Schema!!) was muted, although I'm personally quite bullish on the idea that maybe, just maybe, we can have a place where the data and metadata are combined (!!!). It seems that the alternative to an OSDF-like database is to continue using flat files -- maybe even with URLs sometimes!! -- and I see that approach continuing to engender chaos. An alternative would be nice.
(I hate SQL for dataset storage, but maybe that's not a majority opinion.)
One particularly good quote from Victor Markowitz on the OSDF proposal, paraphrased: "This system will let us spread misunderstandings further and more quickly!" Hmm, is that a negative?
Also, someone said the HMP has 14 trillion reads, or 1.4 petabases of sequence. Whuh? Apparently this is what IMG/HMP and METAREP are being built to analyze. Yikes.
Big Black Data Analysis Boxes
Big black boxes for data analysis is a running theme in metagenomics, although I'm not entirely sure why -- something in the water? (We should sequence that.) Things like MG-RAST, IMG/M, CAMRA, and the various K-base projects offer to take your data (raw or not); fold, spindle, and mutilate it; and then return Results.
This approach gives me the heebiejeebies. I have lots of questions: What software are they running, how, with what parameters? What kind of internal QC do they have, and how can I run it myself on my own data? What version of what databases are used? What filtering do they do? The response (more or less) always seems to be "TRUST us! We know what we're doing!"
That line works better in a sitcom than it does in real life.
I really, really want to see a platform mentality evolve. Cue Steve Yegge's "platform" rant: http://news.ycombinator.com/item?id=3101876 or http://steverant.pen.io/
What's the goal, anyway? Reads vs. assembly
Several people were particularly grumpy about this whole "assembly" thing. I can only imagine this stems from having to go to too many meetings where a lot of really theoretical assembler discussion goes on, and is nonetheless treated like the central activity necessary for metagenomics. Since I work on assembly now, I enjoy those talks, but I fully understand that I am a masochist when it comes to science, and not everyone likes assembly.
Nonetheless I think people skeptical of assembly are largely wrong, at least in the overall concept. Assembly is a really important approach in metagenomics for several reasons, which Kostas Mavrommmmmmmmatis laid out the second day: scientifically speaking, assembly gives way better statistical signals for both gene finding and analysis of variation; you can also get linkage information from assembly, which is important for operon analysis. And, just as important -- from a pragmatic perspective, the reduction in data size from assembly saves both space and computation time, and assembly smooths out errors in way that not much else does.
That's not to say assembly is easy, or a panacea, but I think it's an incredibly valuable approach for all those reasons.
As for single read analysis, I am deeply skeptical of any conclusions reached from analyzing reads of length 120 or less -- that's not much signal, bub. (Note, I'm also guilty of this; check out my publication with Victoria Orphan in 2006. What can I say? I was younger and naiver.) Friends don't let friends analyze Illumina short reads. And yet, it seems to be a prevalent, perhaps even dominant, activity in metagenomics. Boo.
So: assemble, man. It's needed.
But as for goals... well, Brooklin Gore said it well when he said, "You come to understand that the purpose of a machine is ultimately not to run programs, but to solve problems." What problems do we want to solve?
Good question.
There wasn't much unity in the goals of the attendees. Some people wanted whole genomes. Others wanted genes. Yet others wanted variations in those genes. Presumably most people wanted metabolic profiles of one sort or another, except for those who didn't. I think this reflects something that Guy Cochrane pointed out: sequencing is fast becoming a common feature to much of biology, and it is extremely hard to address everyone's needs in one platform. So, that makes it even more unlikely that one single approach or one single analysis platform will answer even a majority of users's needs. There's a lot of room out there, folks.
On benchmarking
Everyone got up and gave a different set of assembly results, and it became kind of obvious that everyone optimizes their assembler for a specific set of statistics and then reports only those. Hey -- what about a standard set of benchmarks??
Well, there was a strange reluctance on the part of some Senior Scientists to invest in assembly benchmarking. As near as I could make out, this was because of a fear that assembler authors would then dive into optimizing for those benchmarks at the expense of Platonic truth. Fair 'nuff. But I think there's gotta be a happy medium between No Benchmarks and Only Benchmarks. Right now I find it extremely difficult to figure out which assemblers are better for what purposes, and I'm pretty sure everyone else is just as confused (modulo the authors of assemblers themselves, who are generally quite positive that their assembler is the best).
Anyhoo, I'm thinking quite hard about setting up a couple of single cell and metagenomic data sets, along with assembly and evaluation pipelines, on Amazon Web Services. That should at least make it easy to run people's assemblers on the same data sets & compare the results.
--
Hmm, conclusion time...
The metagenomic and single cell data is already coming, thanks to our sequencing center overlords.
Generally speaking, we don't know how to handle it, either by assembly or by single read analysis. We certainly don't know how to scale all the analyses that everybody wants to do.
The data will still be coming, regardless.
Good times!
--titus
posted at: 20:25 | path: /oct-11 | 0 comments
Tue, 09 Aug 2011
Assembling genomes with modern sequencing
As sequencing gets cheaper and cheaper, one would expect the answer for how to best sequence (and assemble!) any given genome would change. Most biologists assume something along these lines: everyone else has achieved some standard coverage (say 10x, or 100x) for their genome, so all we need to do is multiply that number times the size of my genome of interest, and then multiply that by the cost/bp, and voila! I will be able to have my very own genome sequence!
Naturally it's a bit more complicated than that, for a couple of reasons. First, the length of the reads matters quite a bit. If you're reading off a 1 GB eukaryotic genome in chunks of 100 bases, you're going to have trouble assembling the darn thing. First, you have to worry about complex repeats, which (in the context of assembly) are just plain evil, because they create connectivity structures that simply can't be resolved without additional information. Second, you need to think about sequencing bias, such as GC and AT rich regions -- most sequencers don't do that well on GC-rich regions, which are plentiful in big eukaryotic genomes. And third, normal sampling variation in shotgun coverage will screw you, on top of all of this, if you don't think about it.
So, what is the optimal sequencing strategy, then?
There's been some interesting discussion on the assemblathon mailing list about all of this, which, for the most part, I'll be paraphrasing and interpreting: the list archives are closed and the list policy about citing people is that I need to ask them for individual permission, and that's too much work :). If you're interested in the source messages, I recommend subscribing yourself and looking through the archives for messages from June 2011; if they open up the archives, I'll link directly to some of the more interesting messages.
A key component of any sequencing strategy discussion nowadays is that sequencing has become very commercial. While this drives down costs (pretty dramatically!), you also can't trust a damn thing that sequencing companies say, because the market is very competitive and there's very little percentage in straight-up honesty, much less full disclosure. (Paranoid much? Yeah, buy me beer sometime.) Moreover, there are several competing sequencing centers -- primarily the Broad Institute and the Beijing Genome Institute, as well as the Joint Genome Institute, Sanger, and St. Louis, and probably another five that I'm missing -- that all appear to have adopted different policies with respect to sequencing genomes. I don't really know what they are in detail, but (for example) Broad has a stereotyped sequencing strategy for which it has written its own software suite (see ALLPATHS-LG), and you can read the details in the PNAS paper. The bottom line is you need to talk to people who have experience with actual sequence, and not be overly trusting of either sequencing centers or company reps.
Another key component of any sequencing strategy discussion is the software being used to assemble. Some centers have their own assemblers (BGI has SOAPdenovo, Broad has ALLPATHS-LG), but there are literally dozens of assemblers out there. The assemblers can broadly be broken down into about four different types: overlap-layout-consensus, de Bruijn graph, greedy local, and "other". I'm most familiar with de Bruijn graph assemblers, since that's what I'm working with here at MSU, but there are advantages and disadvantages to the various kinds. Maybe more on that later. But the bottom line here is that there are many brilliant, passionate, opinionated people who have written their own assembler, and will swear by all that is holy that it is the best one. How do you choose?
A third key component of any sequencing strategy discussion is the genome itself. Mihai Pop's group just published a veddy interesting article on prokaryotic assembly (see Wetzel et al., 2011) in which they argue that the optimal sequencing strategy needs to be dynamically adjusted to the repeat structure of the genome: that is, you need to do a first sequencing run; analyze it for repeat structures; and then plan out your next rounds of sequencing based on that information. While I am always suspicious of plans that require intelligent thought (slow! expen$ive!) to be inserted into sequencing pipelines (fast! high throughput!), I think they make a pretty good argument -- and that's just for prokaryotic genomes, which are simple compared to eukaryotic genomes... for eukaryotic genomes, you also have to worry about heterozygosity (how much internal variation there is between the two haploid genomes you're sequencing). So how can you strategize to deal with your genome?
But let's back up. What are we doing, again?
Sequencing genomes is like this:
Long, not-terribly-random strings of (physical) DNA, O(10^7-10^10) in length.
Goal: determine full sequence and connectivity of strings of DNA.
Process: fragment into lots of bits, sequence in from both ends of each bit. Use overlaps, size of bits ("insert size"), to computationally reassemble.
(You can read an earlier blog post about why this is a hard problem here, or go read the UMD CBCB assembly primer here.)
The challenge, succinctly put, is this: in the face of uneven coverage and repetitive subsequences, devise the optimal coverage and range of insert sizes so that you can (a) sample most of the genome sufficiently and (b) resolve most repetitive regions by looking at pairs of ends. Do so (c) as cheaply as possible.
OK, so what are the parameters you can twiddle?
It really boils down to these choices:
Sequencing technology: 454 or Illumina are the main production machines these days, although I hear things about PacBio, Ion Torrent, and ABI SOLiD. 454 is much more expensive per base, but gives longer reads (500bp +); Illumina is (much) cheaper per base, but the reads are annoyingly short (100-150 bp). With Illumina you can get ~600 bp inserts easily, larger inserts (3kb, 5kb, 10kb) with more difficulty. Not sure about 454.
Coverage: how much money do you want to spend, on what sequencing technology?
Insert sizes: larger inserts are really useful for bridging repeats, but also much more expensive.
And... I think that's about it. Or is it?
Well, you need to ask two more questions: can your assembler of choice take advantage of mixed read lengths, with mixed error models from different technologies, and/or various insert sizes? And can your sequencing center actually make all the different technologies work reliably?
(As I keep telling my students, if it were easy they wouldn't need brilliant people like us to work on it, now would they?)
When I get swamped with these kinds of questions, I usually try to retreat back into my reductionist hidey hole to clear my head. So let's back up again. What are the fundamental issues?
We can't do much about sequencing bias or heterozygosity, except to say that more coverage is generally going to make both biases and internal sequence variation stand out more reliably from random error. If we actually want to assemble our genome, we also can't do much about improving current assemblers, and it's unclear how to evaluate assemblers anyway, and most of them don't appear to do a great job on very heterogenous sequence types (i.e. from multiple types of sequencers) - anyway, these are the questions the assemblathon is asking, and they're doing a good job; just read the paper when it comes out. And we don't have much control over whether or not our sequencing center screws up.
So we're left with trying to decide on how much 454, how much Illumina, and what insert sizes. (Can you hear the shrieks of pain from sequencing and assembly aficionados as I ruthlessly strip all of the subtleties from the argument? Fun!)
For insert size, I like to point people to these two references:
Whiteford et al., Nuc. Acid Res, 2005 http://nar.oxfordjournals.org/content/33/19/e171.full
Butler et al., Genome Res, 2008 http://genome.cshlp.org/content/18/5/810.full
which make the nice point that there are many repeat structures that you simply cannot resolve with single-ended reads -- you need paired-end reads to do a good job of assembly. These two papers have recently been joined by a third, the Wetzel et al. paper above, which suggests that there are particular (and surprisingly frequent) repeat structures that cannot be resolved except by a very specific insert size. But barring advance knowledge of repeat structure, I would argue that a nice range of inserts, from 3k to 5k to 10k, should give you decent results. We have that for a parasitic nematode project in which I'm involved, and it's given us decent scaffold sizes.
With 454 vs Illumina, I am skeptical that 454 is a good expenditure of money at this point. The number of bases is so astonishingly low compared to what Illumina is outputting (~1m vs ~1bn for the same amount of money, I think? At any rate, at least 100x) that you really need to justify any 454 expenditure. That having been said, I may be so used to working with crappy genome assemblies (buy me beer, hear me weep) that I'm ignoring how much better they would be with ~10x 454 coverage. Certainly Greg Dick's group at U of M has shown me pretty good evidence that 454 sequences things that Illumina won't touch, in metagenomic data. So I can't give you much more than my experience that Illumina will get you ~80% of the way to a decent genome assembly -- which is something many people would love to have.
Is there an elephant in the room, and, if so, what is it? Well, this touches heavily on our lab's research, but I think that sequencing biases are screwing up the assembly game far more than people think. Right now assemblers have a bunch of poorly understood heuristics that address sequencer-specific bias, and our experience with these in metagenomic sequencing suggests that these artifacts and heuristics are a significant source of misassembly. More on that ... later.
I'm really at a loss about how to conclude any discussion of sequencing strategy. It's ridiculously complicated, comes down to a lot of guessing about what problems you're likely to run into, and involves an extremely rapidly changing technology suite. Getting a comprehensive answer out of anyone is hard... and won't get any easier for a while.
That having been said, I'd appreciate pointers to blog posts and open discussions of these issues on mailing lists. Having (tried to) teach some biologists in this area recently, as part of my NGS course, I think actually providing these discussions could be incredibly valuable and could raise the level of discourse a fair bit.
--titus
posted at: 17:10 | path: /aug-11 | 4 comments
Sun, 29 Aug 2010
Assembly is hard because it's not decomposable
(with Adina Howe, Jason Pell, Rosangela Canino-Koning, and Arend Hintze).
Introduction
A few weeks ago I blogged a bit about a k-mer filtering system, khmer, that we were using to reduce metagenomic data to a more tractable size by throwing out error-prone reads (see A memory efficient way to remote low-abundance k-mers from large DNA data sets). No sooner had we tried that, than did we realize that we were probably primarily throwing away good, if low-abundance data (see Illumina reads and their features). No matter: we couldn't assemble the original data sets anyway, so we had to get rid of some of it, right?
The subject of this blog post is not on how to best throw away data. (I'll address that in a few weeks.) Instead, it's on why we have to throw away data in the first place. More precisely,
Why is assembly hard?
First, some background. Imagine you have some long-ish strings (1mn - 200 mn in length), composed of only the letters A, C, G, and T, and you want to know what the sequence of the strings is. You can't actually read the sequences directly; they're too physically small. But you can randomly retrieve short subsequences ~100-1000 letters in length from the original long sequences. You don't know where they're from on the original sequence, or even which of the original sequences they're from. And the process of retrieval is error-prone, so you can't even trust the exact sequence you get. But you do know that, by and large, the short sequences are mostly correct; and (the most important bit) that you can get as many of these short sequences as you want, within $$ limitations.
From this kind of information you want to reconstruct the original sequences.
This is a basic description of the process of shotgun sequencing, in which you take DNA, shred it, and then sequence from it randomly -- many, many times. And it lays out the basic problem of assembly, too: you want to figure out how to reconstruct the original sequences from the little subsequences that you actually have.
If you are a computer scientist, you can probably already think of some basic ways to proceed. For example, you could do an all-by-all comparison of the short sequences, lay out which ones overlap and how, build a map of the overlaps, and try to build a tiling path that maximizes the connectivity of your map. Voila! Some approximation of the original sequences results! This approach is known as the overlap-layout-consensus approach, where at the end you produce a consensus view of the original sequence based on all the reads you have.
If you are a computer scientist or someone who programs for a living, you will also immediately recognize this as a rilly rilly hard problem! Forget biological peccadilloes; just doing this efficiently for large collections of sequences is computationally quite difficult. In particular, the all-by-all comparison is brutal: the number of comparisons scales as N**2 with the number of sequences N, so even if it's relatively efficient to compare two sequences, the problem behaves poorly as your data set grows. Plus, building a map of the overlaps is another hard problem: holding all that information in memory requires (yep!) O(N**2) memory, which is not cheap.
Is there any easy way to break down the problem? After all, big computers aren't cheap, but small computers are; so if you could split the problem into many smaller chunks, you could imagine using a grid or Beowulf approach, and just buying lots and lots of cheap hardware to scale.
Alas, the problem isn't easy to subdivide. It's easy to see why, if you think about the nature of the original sequences. Here's a little diagram; suppose, for example, that you have four subsequences all derived from one original sequence:
(orig) atggaccagatgagagcatgagccatggacggatcatggaaaacggttaaaaggggcatgg (1) atggaccagatgagagca (2) gagcatgagccatggacggatc (3) ggatcatggaaaacggttaaaa (4) ttaaaaggggcatgg
If the layout above is the only way that subsequences 1-4 overlap and can assemble, then to decompose the overlap problem across multiple computers would involve sending (1) and (2) to one computer, and (3) and (4) to another, assembling them there, and then taking the results and composing them on a shared node. Unfortunately, to do this efficiently currently requires that you know that 1 and 2 overlap, and that 3 and 4 overlap -- which is basically the problem that you already need to solve!
As I understand it -- I'm not a computer scientist unless you look at my letterhead -- there is simply no efficient way to decompose the overlap-layout-consensus assembly algorithm without either assuming something about the structure of the data, and/or introducing errors. (If you disagree, I'd appreciate either a reference or an implementation; thanks ;)
The second, or possibly third, generation of assemblers
OK, but computer scientists and computational biologists aren't dumb, and they like to tackle hard problems, and frankly this is an incredibly important problem to solve (for all sorts of reasons that you'll have to trust me on for now). Moreover, N^2 scaling is simply unacceptable!
Newer assemblers use a de Bruijn graph approach. Essentially, this involves breaking the subsequences down into fixed-length words of length k, and constructing an overlap graph. For example, taking the sequences above,:
(orig) atggaccagatgagagcatgagccatggacggatcatggaaaacggttaaaaggggcatgg (1) atggaccagatgagagca (2) gagcatgagccatggacggatc (3) ggatcatggaaaacggttaaaa (4) ttaaaaggggcatgg
you would break the original sequences down into words of length (say) 5, yielding:
atgga gatga catga atgga atcat aaacg aaagg
tggac atgag atgag tggac tcatg aacgg aaggg
ggacc tgaga tgagc ggacg catgg acggt agggg
gacca gagag gagcc gacgg atgga cggtt ggggc
accag agagc agcca acgga tggaa ggtta gggca
ccaga gagca gccat cggat ggaaa gttaa ggcat
cagat agcat ccatg ggatc gaaaa ttaaa gcatg
agatg gcatg catgg gatca aaaac taaaa catgg
aaaag
The overlaps between k-mers now implicitly give you a graph connecting each k-mer to all overlapping k-mers; and if you can find a path that traverses every node in this graph once, you will have your original contig.
Note that this actually works, although of course k must be much bigger than 5 in practice, and there are all sorts of cute tricks you must play to do a good job of disentangling complicated graphs.
Why is this an advantage over the overlap/layout/consensus approach that we looked at first? I'm not sure I've identified all the reasons, but there are at least two very important ones.
First, memory usage. While your memory usage for finding overlaps grows > O(N) with the overlap approach (with sparse matrices it should be N log N, I think?), the de Bruijn graph approach consumes only as much memory as you need to represent each new k-mer (so, with the number of novel k-mers) as well as the connections between them (which can be implicitly represented if you have efficient k-mer lookup). For large, deeply sequenced data sets this is going to be a huge savings: there are only three billion bases in the human genome, and probably only two billion unique k-mers of length 32 -- so if you can store k-mers efficiently (hint: you can) then the de Bruijn graph approach is really great.
Second, k-mers and k-mer overlaps can be stored and queried efficiently -- you just use a hash table or a trie structure. For example, you can store all 4**17 k-mers of length 17 as 34-bit offsets in a hash table (2 bits per DNA base), or you can use a branching trie structure to store arbitrarily long k-mers (see tallymer). Hash tables are be efficient (if big) representations for densely occupied k-mer spaces, while tries will be efficient for sparsely occupied k-mer spaces. Arbitrary length sequences are comparatively difficult to store and query.
The de Bruijn graph approach is what Velvet, ABySS, and SOAPdenovo use, and it seems to work well.
So what's the problem?
Scaling. Scaling is the problem.
Well, that and the sequencing companies and the biologists.
Let me explain. Sequencing companies are producing newer and bigger and better machines, that produce more and more sequence, every week. The Illumina GA2 produces 10-100 Gb of sequence per run now. The HiSeq 2000 is going to produce even more enormous amounts of sequence as soon as we get one. And more, lots more, is on the way.
This wouldn't be a problem if biologists would just stick to the exciting old problems, like resequencing humans and doing transcriptomes etc. But noooo, biologists see these juicy new sequencers and think -- hey! I could sequence populations of organisms! Or, like, 30 new organisms at once! Or 30 transcriptomes at once! And it will be cheap! (And we'll have someone else do the bioinformatics, which is easy, right? Right?)
So the sequencing companies are producing newer and cheaper and faster sequencing machines, and the biologists are using them to tackle ever more exciting and novel and challenging biological questions, and ... guess what? Our existing tools and approaches don't scale very well.
For one very specific example, the de Bruijn graph approach breaks down completely if you are sequencing endlessly diverse populations, as we seem to be doing in metagenomics. If you have some high abundance organisms, and a lot more low abundance organisms, and you sequence the organism soup to some arbitrary level, the novel k-mers will swamp your assembler, and to no end -- because those k-mers are never going to assemble to anything big without more sequencing. In which case you've compounded your swamping problem in an attempt to solve your earlier problem.
Similar things happen with wild population sequencing, where you get new and diverse sequences every time you look at a new animal; humans, even with their relatively low diversity, are one fine example.
OK, so this is the problem to solve, and I think it's a really big problem. It's not decomposable so it can't be made to scale well, and we're already at the limit of our existing compute infrastructure for the data we already have. (See Terabase metagenomics -- the computational side and grim future for sequencing centers.) And as we try to inch the boundaries along, the sequencing companies are producing new and bigger machines to give us new and bigger amounts of data.
Are there any solutions? No really good ones, unfortunately. The solution du jour (see MetaHIT methods and my earlier blog posts) is to throw away low abundance data that you figure won't assemble, and/or subdivide the sequences by abundance, in the expectation that similar abundance sequences will come from the same original genome. These are basically approximation heuristics, hoping to reduce the data in such a way that the assembler can deal with it. The hope is that the assembler can do a not-terribly-bad-job of assembly based on the known structure of the population.
Moreover, the throwing-away-data solution won't scale very well; soon enough you'll be throwing away not just 90% of the data, but 99% of the data, just to get a tractable data set.
We are doomed, doomed I say! Clearly we should give up.
Anyway, this concludes part one of a series of blog posts on assembly. In part two, I plan to talk a bit about paired-end sequencing and repeat sequences.
--titus
p.s. An excellent assembly algorithm reference: Miller, Koren, and Sutton, Genomics, 2010.
posted at: 15:07 | path: /aug-10 | 4 comments