Tue, 09 Aug 2011
Assembling genomes with modern sequencing
As sequencing gets cheaper and cheaper, one would expect the answer for how to best sequence (and assemble!) any given genome would change. Most biologists assume something along these lines: everyone else has achieved some standard coverage (say 10x, or 100x) for their genome, so all we need to do is multiply that number times the size of my genome of interest, and then multiply that by the cost/bp, and voila! I will be able to have my very own genome sequence!
Naturally it's a bit more complicated than that, for a couple of reasons. First, the length of the reads matters quite a bit. If you're reading off a 1 GB eukaryotic genome in chunks of 100 bases, you're going to have trouble assembling the darn thing. First, you have to worry about complex repeats, which (in the context of assembly) are just plain evil, because they create connectivity structures that simply can't be resolved without additional information. Second, you need to think about sequencing bias, such as GC and AT rich regions -- most sequencers don't do that well on GC-rich regions, which are plentiful in big eukaryotic genomes. And third, normal sampling variation in shotgun coverage will screw you, on top of all of this, if you don't think about it.
So, what is the optimal sequencing strategy, then?
There's been some interesting discussion on the assemblathon mailing list about all of this, which, for the most part, I'll be paraphrasing and interpreting: the list archives are closed and the list policy about citing people is that I need to ask them for individual permission, and that's too much work :). If you're interested in the source messages, I recommend subscribing yourself and looking through the archives for messages from June 2011; if they open up the archives, I'll link directly to some of the more interesting messages.
A key component of any sequencing strategy discussion nowadays is that sequencing has become very commercial. While this drives down costs (pretty dramatically!), you also can't trust a damn thing that sequencing companies say, because the market is very competitive and there's very little percentage in straight-up honesty, much less full disclosure. (Paranoid much? Yeah, buy me beer sometime.) Moreover, there are several competing sequencing centers -- primarily the Broad Institute and the Beijing Genome Institute, as well as the Joint Genome Institute, Sanger, and St. Louis, and probably another five that I'm missing -- that all appear to have adopted different policies with respect to sequencing genomes. I don't really know what they are in detail, but (for example) Broad has a stereotyped sequencing strategy for which it has written its own software suite (see ALLPATHS-LG), and you can read the details in the PNAS paper. The bottom line is you need to talk to people who have experience with actual sequence, and not be overly trusting of either sequencing centers or company reps.
Another key component of any sequencing strategy discussion is the software being used to assemble. Some centers have their own assemblers (BGI has SOAPdenovo, Broad has ALLPATHS-LG), but there are literally dozens of assemblers out there. The assemblers can broadly be broken down into about four different types: overlap-layout-consensus, de Bruijn graph, greedy local, and "other". I'm most familiar with de Bruijn graph assemblers, since that's what I'm working with here at MSU, but there are advantages and disadvantages to the various kinds. Maybe more on that later. But the bottom line here is that there are many brilliant, passionate, opinionated people who have written their own assembler, and will swear by all that is holy that it is the best one. How do you choose?
A third key component of any sequencing strategy discussion is the genome itself. Mihai Pop's group just published a veddy interesting article on prokaryotic assembly (see Wetzel et al., 2011) in which they argue that the optimal sequencing strategy needs to be dynamically adjusted to the repeat structure of the genome: that is, you need to do a first sequencing run; analyze it for repeat structures; and then plan out your next rounds of sequencing based on that information. While I am always suspicious of plans that require intelligent thought (slow! expen$ive!) to be inserted into sequencing pipelines (fast! high throughput!), I think they make a pretty good argument -- and that's just for prokaryotic genomes, which are simple compared to eukaryotic genomes... for eukaryotic genomes, you also have to worry about heterozygosity (how much internal variation there is between the two haploid genomes you're sequencing). So how can you strategize to deal with your genome?
But let's back up. What are we doing, again?
Sequencing genomes is like this:
Long, not-terribly-random strings of (physical) DNA, O(10^7-10^10) in length.
Goal: determine full sequence and connectivity of strings of DNA.
Process: fragment into lots of bits, sequence in from both ends of each bit. Use overlaps, size of bits ("insert size"), to computationally reassemble.
(You can read an earlier blog post about why this is a hard problem here, or go read the UMD CBCB assembly primer here.)
The challenge, succinctly put, is this: in the face of uneven coverage and repetitive subsequences, devise the optimal coverage and range of insert sizes so that you can (a) sample most of the genome sufficiently and (b) resolve most repetitive regions by looking at pairs of ends. Do so (c) as cheaply as possible.
OK, so what are the parameters you can twiddle?
It really boils down to these choices:
Sequencing technology: 454 or Illumina are the main production machines these days, although I hear things about PacBio, Ion Torrent, and ABI SOLiD. 454 is much more expensive per base, but gives longer reads (500bp +); Illumina is (much) cheaper per base, but the reads are annoyingly short (100-150 bp). With Illumina you can get ~600 bp inserts easily, larger inserts (3kb, 5kb, 10kb) with more difficulty. Not sure about 454.
Coverage: how much money do you want to spend, on what sequencing technology?
Insert sizes: larger inserts are really useful for bridging repeats, but also much more expensive.
And... I think that's about it. Or is it?
Well, you need to ask two more questions: can your assembler of choice take advantage of mixed read lengths, with mixed error models from different technologies, and/or various insert sizes? And can your sequencing center actually make all the different technologies work reliably?
(As I keep telling my students, if it were easy they wouldn't need brilliant people like us to work on it, now would they?)
When I get swamped with these kinds of questions, I usually try to retreat back into my reductionist hidey hole to clear my head. So let's back up again. What are the fundamental issues?
We can't do much about sequencing bias or heterozygosity, except to say that more coverage is generally going to make both biases and internal sequence variation stand out more reliably from random error. If we actually want to assemble our genome, we also can't do much about improving current assemblers, and it's unclear how to evaluate assemblers anyway, and most of them don't appear to do a great job on very heterogenous sequence types (i.e. from multiple types of sequencers) - anyway, these are the questions the assemblathon is asking, and they're doing a good job; just read the paper when it comes out. And we don't have much control over whether or not our sequencing center screws up.
So we're left with trying to decide on how much 454, how much Illumina, and what insert sizes. (Can you hear the shrieks of pain from sequencing and assembly aficionados as I ruthlessly strip all of the subtleties from the argument? Fun!)
For insert size, I like to point people to these two references:
Whiteford et al., Nuc. Acid Res, 2005 http://nar.oxfordjournals.org/content/33/19/e171.full
Butler et al., Genome Res, 2008 http://genome.cshlp.org/content/18/5/810.full
which make the nice point that there are many repeat structures that you simply cannot resolve with single-ended reads -- you need paired-end reads to do a good job of assembly. These two papers have recently been joined by a third, the Wetzel et al. paper above, which suggests that there are particular (and surprisingly frequent) repeat structures that cannot be resolved except by a very specific insert size. But barring advance knowledge of repeat structure, I would argue that a nice range of inserts, from 3k to 5k to 10k, should give you decent results. We have that for a parasitic nematode project in which I'm involved, and it's given us decent scaffold sizes.
With 454 vs Illumina, I am skeptical that 454 is a good expenditure of money at this point. The number of bases is so astonishingly low compared to what Illumina is outputting (~1m vs ~1bn for the same amount of money, I think? At any rate, at least 100x) that you really need to justify any 454 expenditure. That having been said, I may be so used to working with crappy genome assemblies (buy me beer, hear me weep) that I'm ignoring how much better they would be with ~10x 454 coverage. Certainly Greg Dick's group at U of M has shown me pretty good evidence that 454 sequences things that Illumina won't touch, in metagenomic data. So I can't give you much more than my experience that Illumina will get you ~80% of the way to a decent genome assembly -- which is something many people would love to have.
Is there an elephant in the room, and, if so, what is it? Well, this touches heavily on our lab's research, but I think that sequencing biases are screwing up the assembly game far more than people think. Right now assemblers have a bunch of poorly understood heuristics that address sequencer-specific bias, and our experience with these in metagenomic sequencing suggests that these artifacts and heuristics are a significant source of misassembly. More on that ... later.
I'm really at a loss about how to conclude any discussion of sequencing strategy. It's ridiculously complicated, comes down to a lot of guessing about what problems you're likely to run into, and involves an extremely rapidly changing technology suite. Getting a comprehensive answer out of anyone is hard... and won't get any easier for a while.
That having been said, I'd appreciate pointers to blog posts and open discussions of these issues on mailing lists. Having (tried to) teach some biologists in this area recently, as part of my NGS course, I think actually providing these discussions could be incredibly valuable and could raise the level of discourse a fair bit.
--titus
posted at: 17:10 | path: /aug-11 | 4 comments
Sun, 07 Aug 2011
Why the Cloud does not solve the computational scaling problem in biology
There's been a lot of hooplah in the last year or so about the fact that our ability to generate sequence has scaled faster than Moore's Law over the last few years, and the attendant challenges of scaling analysis capacity; see Figure 1a and 1b, this reddit discussion, and also my the sky is falling! blog post.
There's also been been some backlash -- it's gotten to the point where showing any of the various graphs is greeted with derision, at least judging by the talk-associated Twitter feed.
Figure 1a. DNA sequencing costs, from http://www.genome.gov/sequencingcosts/
Figure 1b. Sequencing costs vs hard disk costs. Slide courtesy of Lincoln Stein.
From by the discussions I've seen, I think people still don't get -- or at least don't talk about -- the implications of this scaling behavior. In particular, I am surprised to hear the cloud (the cloud! the cloud!) touted as The Solution, since it's clearly not a solution to the actual scaling problem. (Also see: Wikipedia on cloud computing.)
To see why, consider the following model. Take two log-linear plots, one for cost per unit of compute power (CPU cycles, disk space, RAM, what have you), and one for cost per unit of sequence ($$ per bp of DNA). Now suppose that sequence cost is decreasing faster than compute cost, so you have two nice, diverging linear lines when you plot these trends over time on a log-linear plot (see Figure 2).
Figure 2. A simple model of the (exponential) decrease in compute costs vs (exponential) decrease sequencing data costs, against time.
Suppose we're interested in how much money to allocate to sequencing, vs how much money to allocate to compute -- the heart of the problem. How do these trends behave? One way to examine them is to look at the ratio of the data points.
Figure 3. The ratio of compute power to data over time, under the model in the previous figure.
As you'd expect, the ratio of compute power to data is also log-linear (Figure 3) -- it's just the difference between the two lines in Figure 2. Straight lines on log-linear plots, however, are in reality exponential -- see Figure 4! This is a linear-scale plot of compute costs relative to data costs -- and as you can see, compute costs end up dominating.
Figure 4. Ratio of compute power to data, over time, on a linear plot.
With this model, then, for the same dollar value of data, your relative compute costs will increase by a factor of 1000 over 10 years. This is true whether or not you're using the cloud! While your absolute costs may go up or down depending on infrastructure investments, Amazon's pricepoint, etc., the fundamental scaling behavior doesn't change. It doesn't much matter if Amazon is 2x cheaper than your HPC -- check out Figures 5a, b, and c if you need graphical confirmation of the math.
Figure 5a,b,c: Scaling behavior isn't affected by linearly lower costs.
The bottom line is this: when your data cost is decreasing faster than your hardware cost, the long-term solution cannot be to buy, rent, borrow, beg, or steal more hardware. The solution must lie in software and algorithms.
People who claim that cloud computing is going to provide an answer to the scaling issue with sequence, then, must be operating with some additional assumptions. Maybe they think the curves are shifted relative to one another, so that even 1000x costs are not a big deal - although figure 1 sort of argues against that. Like me, maybe they've heard that hard disks are about to start scaling way, way better -- if so, awesome! That might change the curves for data storage, if not analysis. Perhaps their research depends on using only a bounded amount of sequence -- e.g. single-genome sequencing, for which you can stop generating data at a certain point. Or perhaps they're proposing to use algorithms that scale sub-linearly with the amount of data they're applied to (although I don't know of any). Or perhaps they're planning for the shift in Moore's Law behavior that will come when that Amazon and other cloud computing providers build self-replicating compute clusters on the moon (hello, exo-atmospheric computing!) Whatever the plan, it would be interesting to hear their assumptions explained.
I think one likely answer to the Big Data conundrum in biology is that we'll come up with cleverer and cleverer approaches for quickly throwing away data that is unlikely to be of any use. Assuming these algorithms are linear in their application to data, but have smaller constants in front of their big-O, this will at least help stem the tide. (It will also, unfortunately, generate more and nastier biases in the results...) But I don't have any answers for what will happen in the medium term if sequencing continues to scale as it does.
It's also worth noting that de novo assembly (my current focus...) remains one of the biggest challenges. It requires gobs of the most expensive computational resource (RAM, which is not scaling as fast as disk and CPU), and there are no good solutions on the horizon for making it scale faster. Neither mRNAseq nor metagenomics are well-bounded problems (you always want more sequence!), and assembly will remain a critical approach for many people for many years. Moreover, cloud assembly approaches like Contrail are (sooner or later) doomed by the same logic as above. But it's a problem we need to solve! As I said at PyCon, "Life's too short to tackle the easy problems -- come to academia!".
--titus
p.s. If you want to play with the curves yourself, here's a Google Spreadsheet, and you can grab a straight CSV file here.
posted at: 12:47 | path: /aug-11 | 8 comments
On mentoring
One of the most important jobs a professor has is to pay it forward: that is, to teach, train, mentor, support, and open up opportunities for their students and postdocs. It's a job that is undervalued by those who focus on the short term -- the administrators and review committees that judge us by the money we bring in and the papers we publish. It's a misunderstood job, as well; the goal of a good mentor is not to mold their students in their own image, but to push them; to expand their horizons, not to contract them around the mentor's own views. And it's probably one of the two or three most rewarding parts of the job (surpassed only by the fun of actually doing the research!)
I came to academia to do science: to research biology, and to solve problems that are as yet unsolved. And my goal is to become a good researcher, and to solve really hard problems well; hopefully that will be part of whatever legacy I have. But it's increasingly clear to me that the best, most lasting legacy possible is to train the next generations of scientists in doing good science, be it in industry or academia. For example, my father trained nearly 100 PhD students during his research career, and always felt that his training record was one of his most impactful contributions -- many of those 100 students are professors today, doing their own training. Pay it forward, indeed!
Recently, I realized that if I drop dead tomorrow (or don't get tenure in three years -- same thing, right?) I would still have touched a number of people's lives in really positive ways. For example, I received an e-mail from one student that participated in GSoC, telling me that she felt she had learned an immense amount about self-sufficiency and the value of making an effort from that experience. I hired another student to do lab grunt work; she ended up liking science, switching majors, graduating, and has now received several national fellowships and is going to graduate school. I don't take credit for much past hiring her, but if I hadn't hired her, she would not have had the opportunity to show how good she is. Blogging and writing tutorials counts, too: at PyCon, 2-3 people a year come up to me and tell me how much they appreciate one or another of my blog posts or tutorials. Even class teaching can be impactful -- at graduation recently, an entire group of CS students told me how much they'd enjoyed my class on Web dev, with one particular student introducing me to her parents as "one of the good ones".
It's hard to overstate how fantastic it is to watch students grow and change.
Despite the joys of transforming lives, many professors only want to take energetic, intelligent, well-spoken, well-trained students. Yet these students are the ones that already have opportunities, and frankly need less mentoring than others -- they already "get it", and really just need experience. Providing that experience is valuable, and a key part of training. But the same professors that jump at the 4.0 GPA student who speaks English natively and has tons of energy will turn down opportunities to take students from non-research intensive institutions, or unfocused students who come from non-academic backgrounds, or people who haven't the faintest idea what science is. And I think they do themselves and the students a disservice. These students simply have never seen the same opportunities that many students at (e.g.) MSU take for granted; the difference you could make in their lives dwarfs the impact you'd make on most better prepared students.
It's worth remembering that at some point we all start as wet-behind-the-ears youngsters that have never confronted a problem without an answer. Someone took a chance on us -- maybe it was easier to get that chance for those of us from a top-ranked institution, with an academic background (...me), or maybe it was a random "hey, do you want to work for me over the summer?" chance (...lots of people), or maybe it was an organized program to introduce underrepresented minorities to research (...lots of people). Now that you're a grad student, or a postdoc, or a professor, or an open source hacker with commit privileges to a dozen projects, take that chance. It's a lot of work but it can be more rewarding than pretty much anything else.
If I have a point with this blog post, it's this: one of the privileges of academia, and mentoring programs like GSoC, is being put in a position where you can touch many people's lives, as a mentor and a teacher. Do so! Take chances on people! Lay out some expectations and see who rises to meet them, and then chase down those that don't understand what to do. Don't start out with the expectation that you must be, or will be, rewarded in kind -- many students take years to be productive, may end up working on something completely different with someone else, and may never even say thanks. But that's OK, if you don't treat teaching and undergraduate research as a way to get more work done ('cause honestly, it's not), but rather as an opportunity to introduce the joys of your work to people who may have no idea that such awesome jobs exist. Pay it forward.
--titus
posted at: 11:33 | path: /aug-11 | 3 comments