Sun, 07 Aug 2011
Why the Cloud does not solve the computational scaling problem in biology
There's been a lot of hooplah in the last year or so about the fact that our ability to generate sequence has scaled faster than Moore's Law over the last few years, and the attendant challenges of scaling analysis capacity; see Figure 1a and 1b, this reddit discussion, and also my the sky is falling! blog post.
There's also been been some backlash -- it's gotten to the point where showing any of the various graphs is greeted with derision, at least judging by the talk-associated Twitter feed.
Figure 1a. DNA sequencing costs, from http://www.genome.gov/sequencingcosts/
Figure 1b. Sequencing costs vs hard disk costs. Slide courtesy of Lincoln Stein.
From by the discussions I've seen, I think people still don't get -- or at least don't talk about -- the implications of this scaling behavior. In particular, I am surprised to hear the cloud (the cloud! the cloud!) touted as The Solution, since it's clearly not a solution to the actual scaling problem. (Also see: Wikipedia on cloud computing.)
To see why, consider the following model. Take two log-linear plots, one for cost per unit of compute power (CPU cycles, disk space, RAM, what have you), and one for cost per unit of sequence ($$ per bp of DNA). Now suppose that sequence cost is decreasing faster than compute cost, so you have two nice, diverging linear lines when you plot these trends over time on a log-linear plot (see Figure 2).
Figure 2. A simple model of the (exponential) decrease in compute costs vs (exponential) decrease sequencing data costs, against time.
Suppose we're interested in how much money to allocate to sequencing, vs how much money to allocate to compute -- the heart of the problem. How do these trends behave? One way to examine them is to look at the ratio of the data points.
Figure 3. The ratio of compute power to data over time, under the model in the previous figure.
As you'd expect, the ratio of compute power to data is also log-linear (Figure 3) -- it's just the difference between the two lines in Figure 2. Straight lines on log-linear plots, however, are in reality exponential -- see Figure 4! This is a linear-scale plot of compute costs relative to data costs -- and as you can see, compute costs end up dominating.
Figure 4. Ratio of compute power to data, over time, on a linear plot.
With this model, then, for the same dollar value of data, your relative compute costs will increase by a factor of 1000 over 10 years. This is true whether or not you're using the cloud! While your absolute costs may go up or down depending on infrastructure investments, Amazon's pricepoint, etc., the fundamental scaling behavior doesn't change. It doesn't much matter if Amazon is 2x cheaper than your HPC -- check out Figures 5a, b, and c if you need graphical confirmation of the math.
Figure 5a,b,c: Scaling behavior isn't affected by linearly lower costs.
The bottom line is this: when your data cost is decreasing faster than your hardware cost, the long-term solution cannot be to buy, rent, borrow, beg, or steal more hardware. The solution must lie in software and algorithms.
People who claim that cloud computing is going to provide an answer to the scaling issue with sequence, then, must be operating with some additional assumptions. Maybe they think the curves are shifted relative to one another, so that even 1000x costs are not a big deal - although figure 1 sort of argues against that. Like me, maybe they've heard that hard disks are about to start scaling way, way better -- if so, awesome! That might change the curves for data storage, if not analysis. Perhaps their research depends on using only a bounded amount of sequence -- e.g. single-genome sequencing, for which you can stop generating data at a certain point. Or perhaps they're proposing to use algorithms that scale sub-linearly with the amount of data they're applied to (although I don't know of any). Or perhaps they're planning for the shift in Moore's Law behavior that will come when that Amazon and other cloud computing providers build self-replicating compute clusters on the moon (hello, exo-atmospheric computing!) Whatever the plan, it would be interesting to hear their assumptions explained.
I think one likely answer to the Big Data conundrum in biology is that we'll come up with cleverer and cleverer approaches for quickly throwing away data that is unlikely to be of any use. Assuming these algorithms are linear in their application to data, but have smaller constants in front of their big-O, this will at least help stem the tide. (It will also, unfortunately, generate more and nastier biases in the results...) But I don't have any answers for what will happen in the medium term if sequencing continues to scale as it does.
It's also worth noting that de novo assembly (my current focus...) remains one of the biggest challenges. It requires gobs of the most expensive computational resource (RAM, which is not scaling as fast as disk and CPU), and there are no good solutions on the horizon for making it scale faster. Neither mRNAseq nor metagenomics are well-bounded problems (you always want more sequence!), and assembly will remain a critical approach for many people for many years. Moreover, cloud assembly approaches like Contrail are (sooner or later) doomed by the same logic as above. But it's a problem we need to solve! As I said at PyCon, "Life's too short to tackle the easy problems -- come to academia!".
--titus
p.s. If you want to play with the curves yourself, here's a Google Spreadsheet, and you can grab a straight CSV file here.
posted at: 12:47 | path: /aug-11 | 8 comments
Tue, 24 Nov 2009
Lazyweb query: CloudStore (or KosmosFS)
Does anyone have any experience with CloudStore, formerly known as KosmosFS? From http://en.wikipedia.org/wiki/CloudStore:
CloudStore (KFS, previously Kosmosfs) is Kosmix's C++ implementation of Google File System. ... CloudStore supports incremental scalability, replication, checksumming for data integrity, client side fail-over and access from C++, Java and Python.
The project site is here: http://kosmosfs.sourceforge.net/
I'd be interested in comments on general usability, quality, and "feel"...
thanks!
--titus
posted at: 20:14 | path: /nov-09 | 4 comments