Wed, 04 Jan 2012
What I'm REALLY thinking about when I use your bioinformatics software
If you're like me, we pretend to care about the science in bioinformatics software. But what we really do is try to find reasons not to outright loathe the software -- because, lud knows, there are usually plenty of reasons to hate it.
In no particular order, here are the top 10 things I hate about your bioinformatics software. You know who I'm talking about.
- You posted it on SourceForge (and so I can't download the damn thing using a simple URL).
- You're not using version control (and hence are not a scientist).
- You put _ in the damned file name unnecessarily (that requires a shift key on my keyboard).
- fizbam-0.9.3.tar.gz either untars into the current directory, OR a directory named 'fizbam'. Alternatively, you named it fizbam-0.9.3-2011010101010101010101010101020.tar.gz and it untars into THAT monster of a name (and my ls goes off the screen).
- You have no README, or, if you do, it's a one-liner that refers to a URL (didn't I already download your damned software?) OR 5a. Your README file is in HTML (less is better than lynx, dontcha know?)
- Running 'make' rebuilds everything from scratch every time you run it (seriously?)
- There are neither tests nor examples; or, if there are, I can't run 'em, and even if they do run, I have no idea if the results are correct.
- The output is in some weird format and/or location (wait, I have to do a find to find the last file written, and then guess as to its format?)
- The command line options are poorly labeled and described, use random abbreviations, and/or are sensitive to order (unlike every good command-line parsing library written).
- You CaMEl-CaSED your software name so that not even tab completion can figure it out (program names should be all lower-case, as Darwin intended).
---
Post your top ten and send me the URLs... :)
--titus
posted at: 21:57 | path: /jan-12 | 10 comments
Tue, 20 Dec 2011
Paper draft: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
(updated to point to http://arxiv.org/).
Authors: Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, C. Titus Brown
Abstract:
The memory requirements for de novo assembly of short-read shotgun sequencing data from complex microbial populations are an increasingly large practical barrier to environmental studies. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples, allowing us to reduce the size of the de novo assembly process for metagenomes with a "divide and conquer" algorithm. This graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to store assembly graphs in as little as 4 bits per k-mer. We use this approach to achieve a 20-fold decrease in memory for the assembly of a soil metagenome sample.
The paper is available on arXiv here: http://arxiv.org/abs/1112.4193. Comments are welcome! We're planning to submit it to PNAS later this week.
I'll write a longer blog post about it soon.
--titus
posted at: 09:10 | path: /dec-11 | 0 comments
Sun, 11 Dec 2011
Data Intensive Science, and Workflows
I'm writing this on my way back from Stockholm, where I attended a workshop on the 4th Paradigm. This is the idea (so named by Jim Gray, I gather?) that data-intensive science is a distinct paradigm from the first three paradigms of scientific investigation -- theory, experiment, and simulation. I was invited to attend as a token biologist -- someone in biology who works with large scale data, and thinks about large scale data, but isn't necessarily only devoted to dealing with large scale data.
The workshop was pretty interesting. It was run by Microsoft, who invited me & paid my way out there. The idea was to identify areas of opportunity in which Microsoft could make investments to help out scientists and develop new perspectives on the future of eScience. To do that, we played a game that I'll call the "anchor game", where we divvied up into groups to discuss the blocks to our work that stemmed from algorithms and tools, data, process and workflows, social/organizational aspects. In each group we put together sticky notes with our "complaints" and then ranked them by how big of an anchor they were on us -- "deep" sticky notes held us back more than shallow sticky notes. We then reorganized by discipline, and put together an end-to-end workflow that we hoped in 5 years would be possible, and then finally we looked for short- and medium-term projects that would help get us there.
The big surprise for me in all of this was that it turns out I'm most interested in workflows and process! All of my sticky notes had the same theme: it wasn't tools, or data management, or social aspects that were causing me problems, but rather the development of appropriate workflows for scientific investigation. Very weird, and not what I would have predicted from the outset.
Two questions came up for me during the workshop:
Why don't people use workflows in bioinformatics?
The first question that comes to mind is, why doesn't anyone I know use a formal workflow engine in bioinformatics? I know they exist (Taverna, for one, has a bunch of bioinformatics workflows); I'm reasonably sure they would be useful; but there seems to be some block against using them!
What would the ideal workflow situation be?
During the anchor game, our group (which consisted of two biologists, myself and Hugh Shanahan; a physicist, Geoffrey Fox; and a few computer scientists, including Alex Wade) came up with an idea for a specific tool. The tool would be a bridge between Datascope for biologists and a workflow engine. The essential idea is to combine data exploration with audit trail recording, which could then be hardened into a workflow template and re-used.
---
Thinking about the process I usually use when working on a new problem, it tends to consist of all these activites mixed together:
- evaluation of data quality, and application of "computational" controls
- exploration of various data manipulation steps, looking for statistical signal
- solidifying of various subcomponents of the data manipulation steps into scripted actions
- deployment of the entire thing on multiple data sets
Now, I'm never quite done -- new data types and questions always come up, and there are always components to tweak. But portions of the workflow become pretty solid by the time I'm halfway done with the initial project, and the evaluation of data quality accretes more steps but rarely loses one. So it could be quite useful to be able to take a step back and say, "yeah, these steps? wrap 'em up and put 'em into a workflow, I'm done." Except that I also want to be able to edit and change them in the future. And I'd like to be able to post the results along with the precise steps taken to generate them. And (as long as I'm greedy) I would like to work at the command line, while I know that others would like to be able to work in a notebook or graphical format. And I'd like to be able to "scale out" the computation by bringing the compute to the data.
For all of this I need three things: I need workflow agility, I need workflow versioning, and I need workflow tracking. And this all needs to sit on top of a workflow component model that lets me run the components of the workflow wherever the data is.
I'm guessing no tool out there does this, although I know other people are thinking this way, so maybe I'm wrong. The Microsoft folk didn't know of any, though, and they seemed pretty well informed in this area :).
The devil's choice I personally make in all of this is to go for workflow agility, and ignore versioning and tracking and the component model, by scripting the hell out of things. But this is getting old, and as I get older and have to teach my wicked ways to grad students and postdocs, the lack of versioning and tracking and easy scaling out gets more and more obvious. And now that I'm trying to actually teach computational science to biologists, it's simply appallingly difficult to convey this stuff in a sensible way. So I'm looking for something better.
One particularly intriguing type of software I've been looking at recently is the "interactive Web notebook" -- ipython notebook and sagenb. These are essentially Mathematica or matlab-style notebook packages that work with IPython or Sage, scientific computing and mathematical computing systems in Python. They let you run Python code interactively, and colocate it with its output (including graphics); the notebooks can then be saved and loaded and re-run. I'm thinking about working one or the other into my class, since it would let me move away from command-line dependence a bit. (Command lines are great, but throwing them at biologists, along with Python programming, data analysis, and program execution all together, seems a bit cruel. And not that productive.)
It would also be great to have cloud-enabled workflow components. As I embark on more and more sequence analysis, there are only about a dozen "things" we actually do, but mixed and matched. These things could be hardened, parameterized into components, and placed behind an RPC interface that would give us a standard way to execute them. Combined with a suitable data abstraction layer, I could run the components from location A on data in location B in a semi-transparent way, and also record and track their use in a variety of ways. Given a suitably rich set of components and use cases, I could imagine that these components and their interactions could be executed from multiple workflow engines, and with usefully interesting GUIs. I know Domain Specific Languages are already passe, but a DSL might be a good way to find the right division between subcomponents.
I'd be interested in hearing about such things that may already exist. I'm aware of Galaxy, but I think the components in Galaxy are not quite written at the right level of abstraction for me; Galaxy is also more focused on the GUI than I want. I don't know anything about Taverna, so I'm going to look into that a bit more. And, inevitably, we'll be writing some of our own software in this area, too.
Overall, I'm really interested in workflow approaches that let me transition seemlessly between discovery science and "firing for effect" for hypothesis-driven science.
A few more specific thoughts:
In the area of metagenomics (one of my research focuses at the moment), it would be great to see img/m, camera, and MG-RAST move towards a "broken out" workflow that lets semi-sophisticated computational users (hi mom!) run their stuff on the Amazon Cloud and on private HPCs or clouds. While I appreciate hosted services, there are many drawbacks to them, and I'd love to get my hands on the guts of those services. (I'm sure the MG-RAST folk would like me to note that they are moving towards making their pipeline more usable outside of Argonne: so noted.)
In the comments to my post on Four reasons I won't use your data analysis pipeline, Andrew Davison reminds me of VisTrails, which some people at the MS workshop recommended.
I met David De Roure from myExperiment at the MS workshop. To quote, "myExperiment makes it easy to find, use, and share scientific workflows and other Research Objects, and to build communities."
David put me in touch with Carole Goble who is involved with Taverna. Something to look at.
In the cloud computing workshop I organized at the Planet and Animal Genome conference this January, I will get a chance to buttonhole one of the Galaxy Cloud developers. I hope to make the most of this opportunity ;).
It'd be interesting to do some social science research on what difficulties users encounter when they attempt to use workflow engines. A lot of money goes into developing them, apparently, but at least in bioinformatics they are not widely used. Why? This is sort of in line with Greg Wilson's Software Carpentry and the wonderfully named blog It will never work in theory: rather than guessing randomly at what technical directions need to be pursued, why not study it empirically? It is increasingly obvious to me that improving computational science productivity has more to do with lowering learning barriers and changing other societal or cultural issues than with a simple lack of technology, and figuring out how (and if) appropriate technology could be integrated with the right incentives and teaching strategy is pretty important.
--titus
p.s. Special thanks to Kenji Takeda and Tony Hey for inviting me to the workshop, and especially for paying my way. 'twas really interesting!
posted at: 07:00 | path: /dec-11 | 8 comments
Tue, 09 Aug 2011
Assembling genomes with modern sequencing
As sequencing gets cheaper and cheaper, one would expect the answer for how to best sequence (and assemble!) any given genome would change. Most biologists assume something along these lines: everyone else has achieved some standard coverage (say 10x, or 100x) for their genome, so all we need to do is multiply that number times the size of my genome of interest, and then multiply that by the cost/bp, and voila! I will be able to have my very own genome sequence!
Naturally it's a bit more complicated than that, for a couple of reasons. First, the length of the reads matters quite a bit. If you're reading off a 1 GB eukaryotic genome in chunks of 100 bases, you're going to have trouble assembling the darn thing. First, you have to worry about complex repeats, which (in the context of assembly) are just plain evil, because they create connectivity structures that simply can't be resolved without additional information. Second, you need to think about sequencing bias, such as GC and AT rich regions -- most sequencers don't do that well on GC-rich regions, which are plentiful in big eukaryotic genomes. And third, normal sampling variation in shotgun coverage will screw you, on top of all of this, if you don't think about it.
So, what is the optimal sequencing strategy, then?
There's been some interesting discussion on the assemblathon mailing list about all of this, which, for the most part, I'll be paraphrasing and interpreting: the list archives are closed and the list policy about citing people is that I need to ask them for individual permission, and that's too much work :). If you're interested in the source messages, I recommend subscribing yourself and looking through the archives for messages from June 2011; if they open up the archives, I'll link directly to some of the more interesting messages.
A key component of any sequencing strategy discussion nowadays is that sequencing has become very commercial. While this drives down costs (pretty dramatically!), you also can't trust a damn thing that sequencing companies say, because the market is very competitive and there's very little percentage in straight-up honesty, much less full disclosure. (Paranoid much? Yeah, buy me beer sometime.) Moreover, there are several competing sequencing centers -- primarily the Broad Institute and the Beijing Genome Institute, as well as the Joint Genome Institute, Sanger, and St. Louis, and probably another five that I'm missing -- that all appear to have adopted different policies with respect to sequencing genomes. I don't really know what they are in detail, but (for example) Broad has a stereotyped sequencing strategy for which it has written its own software suite (see ALLPATHS-LG), and you can read the details in the PNAS paper. The bottom line is you need to talk to people who have experience with actual sequence, and not be overly trusting of either sequencing centers or company reps.
Another key component of any sequencing strategy discussion is the software being used to assemble. Some centers have their own assemblers (BGI has SOAPdenovo, Broad has ALLPATHS-LG), but there are literally dozens of assemblers out there. The assemblers can broadly be broken down into about four different types: overlap-layout-consensus, de Bruijn graph, greedy local, and "other". I'm most familiar with de Bruijn graph assemblers, since that's what I'm working with here at MSU, but there are advantages and disadvantages to the various kinds. Maybe more on that later. But the bottom line here is that there are many brilliant, passionate, opinionated people who have written their own assembler, and will swear by all that is holy that it is the best one. How do you choose?
A third key component of any sequencing strategy discussion is the genome itself. Mihai Pop's group just published a veddy interesting article on prokaryotic assembly (see Wetzel et al., 2011) in which they argue that the optimal sequencing strategy needs to be dynamically adjusted to the repeat structure of the genome: that is, you need to do a first sequencing run; analyze it for repeat structures; and then plan out your next rounds of sequencing based on that information. While I am always suspicious of plans that require intelligent thought (slow! expen$ive!) to be inserted into sequencing pipelines (fast! high throughput!), I think they make a pretty good argument -- and that's just for prokaryotic genomes, which are simple compared to eukaryotic genomes... for eukaryotic genomes, you also have to worry about heterozygosity (how much internal variation there is between the two haploid genomes you're sequencing). So how can you strategize to deal with your genome?
But let's back up. What are we doing, again?
Sequencing genomes is like this:
Long, not-terribly-random strings of (physical) DNA, O(10^7-10^10) in length.
Goal: determine full sequence and connectivity of strings of DNA.
Process: fragment into lots of bits, sequence in from both ends of each bit. Use overlaps, size of bits ("insert size"), to computationally reassemble.
(You can read an earlier blog post about why this is a hard problem here, or go read the UMD CBCB assembly primer here.)
The challenge, succinctly put, is this: in the face of uneven coverage and repetitive subsequences, devise the optimal coverage and range of insert sizes so that you can (a) sample most of the genome sufficiently and (b) resolve most repetitive regions by looking at pairs of ends. Do so (c) as cheaply as possible.
OK, so what are the parameters you can twiddle?
It really boils down to these choices:
Sequencing technology: 454 or Illumina are the main production machines these days, although I hear things about PacBio, Ion Torrent, and ABI SOLiD. 454 is much more expensive per base, but gives longer reads (500bp +); Illumina is (much) cheaper per base, but the reads are annoyingly short (100-150 bp). With Illumina you can get ~600 bp inserts easily, larger inserts (3kb, 5kb, 10kb) with more difficulty. Not sure about 454.
Coverage: how much money do you want to spend, on what sequencing technology?
Insert sizes: larger inserts are really useful for bridging repeats, but also much more expensive.
And... I think that's about it. Or is it?
Well, you need to ask two more questions: can your assembler of choice take advantage of mixed read lengths, with mixed error models from different technologies, and/or various insert sizes? And can your sequencing center actually make all the different technologies work reliably?
(As I keep telling my students, if it were easy they wouldn't need brilliant people like us to work on it, now would they?)
When I get swamped with these kinds of questions, I usually try to retreat back into my reductionist hidey hole to clear my head. So let's back up again. What are the fundamental issues?
We can't do much about sequencing bias or heterozygosity, except to say that more coverage is generally going to make both biases and internal sequence variation stand out more reliably from random error. If we actually want to assemble our genome, we also can't do much about improving current assemblers, and it's unclear how to evaluate assemblers anyway, and most of them don't appear to do a great job on very heterogenous sequence types (i.e. from multiple types of sequencers) - anyway, these are the questions the assemblathon is asking, and they're doing a good job; just read the paper when it comes out. And we don't have much control over whether or not our sequencing center screws up.
So we're left with trying to decide on how much 454, how much Illumina, and what insert sizes. (Can you hear the shrieks of pain from sequencing and assembly aficionados as I ruthlessly strip all of the subtleties from the argument? Fun!)
For insert size, I like to point people to these two references:
Whiteford et al., Nuc. Acid Res, 2005 http://nar.oxfordjournals.org/content/33/19/e171.full
Butler et al., Genome Res, 2008 http://genome.cshlp.org/content/18/5/810.full
which make the nice point that there are many repeat structures that you simply cannot resolve with single-ended reads -- you need paired-end reads to do a good job of assembly. These two papers have recently been joined by a third, the Wetzel et al. paper above, which suggests that there are particular (and surprisingly frequent) repeat structures that cannot be resolved except by a very specific insert size. But barring advance knowledge of repeat structure, I would argue that a nice range of inserts, from 3k to 5k to 10k, should give you decent results. We have that for a parasitic nematode project in which I'm involved, and it's given us decent scaffold sizes.
With 454 vs Illumina, I am skeptical that 454 is a good expenditure of money at this point. The number of bases is so astonishingly low compared to what Illumina is outputting (~1m vs ~1bn for the same amount of money, I think? At any rate, at least 100x) that you really need to justify any 454 expenditure. That having been said, I may be so used to working with crappy genome assemblies (buy me beer, hear me weep) that I'm ignoring how much better they would be with ~10x 454 coverage. Certainly Greg Dick's group at U of M has shown me pretty good evidence that 454 sequences things that Illumina won't touch, in metagenomic data. So I can't give you much more than my experience that Illumina will get you ~80% of the way to a decent genome assembly -- which is something many people would love to have.
Is there an elephant in the room, and, if so, what is it? Well, this touches heavily on our lab's research, but I think that sequencing biases are screwing up the assembly game far more than people think. Right now assemblers have a bunch of poorly understood heuristics that address sequencer-specific bias, and our experience with these in metagenomic sequencing suggests that these artifacts and heuristics are a significant source of misassembly. More on that ... later.
I'm really at a loss about how to conclude any discussion of sequencing strategy. It's ridiculously complicated, comes down to a lot of guessing about what problems you're likely to run into, and involves an extremely rapidly changing technology suite. Getting a comprehensive answer out of anyone is hard... and won't get any easier for a while.
That having been said, I'd appreciate pointers to blog posts and open discussions of these issues on mailing lists. Having (tried to) teach some biologists in this area recently, as part of my NGS course, I think actually providing these discussions could be incredibly valuable and could raise the level of discourse a fair bit.
--titus
posted at: 17:10 | path: /aug-11 | 4 comments
Sun, 07 Aug 2011
Why the Cloud does not solve the computational scaling problem in biology
There's been a lot of hooplah in the last year or so about the fact that our ability to generate sequence has scaled faster than Moore's Law over the last few years, and the attendant challenges of scaling analysis capacity; see Figure 1a and 1b, this reddit discussion, and also my the sky is falling! blog post.
There's also been been some backlash -- it's gotten to the point where showing any of the various graphs is greeted with derision, at least judging by the talk-associated Twitter feed.
Figure 1a. DNA sequencing costs, from http://www.genome.gov/sequencingcosts/
Figure 1b. Sequencing costs vs hard disk costs. Slide courtesy of Lincoln Stein.
From by the discussions I've seen, I think people still don't get -- or at least don't talk about -- the implications of this scaling behavior. In particular, I am surprised to hear the cloud (the cloud! the cloud!) touted as The Solution, since it's clearly not a solution to the actual scaling problem. (Also see: Wikipedia on cloud computing.)
To see why, consider the following model. Take two log-linear plots, one for cost per unit of compute power (CPU cycles, disk space, RAM, what have you), and one for cost per unit of sequence ($$ per bp of DNA). Now suppose that sequence cost is decreasing faster than compute cost, so you have two nice, diverging linear lines when you plot these trends over time on a log-linear plot (see Figure 2).
Figure 2. A simple model of the (exponential) decrease in compute costs vs (exponential) decrease sequencing data costs, against time.
Suppose we're interested in how much money to allocate to sequencing, vs how much money to allocate to compute -- the heart of the problem. How do these trends behave? One way to examine them is to look at the ratio of the data points.
Figure 3. The ratio of compute power to data over time, under the model in the previous figure.
As you'd expect, the ratio of compute power to data is also log-linear (Figure 3) -- it's just the difference between the two lines in Figure 2. Straight lines on log-linear plots, however, are in reality exponential -- see Figure 4! This is a linear-scale plot of compute costs relative to data costs -- and as you can see, compute costs end up dominating.
Figure 4. Ratio of compute power to data, over time, on a linear plot.
With this model, then, for the same dollar value of data, your relative compute costs will increase by a factor of 1000 over 10 years. This is true whether or not you're using the cloud! While your absolute costs may go up or down depending on infrastructure investments, Amazon's pricepoint, etc., the fundamental scaling behavior doesn't change. It doesn't much matter if Amazon is 2x cheaper than your HPC -- check out Figures 5a, b, and c if you need graphical confirmation of the math.
Figure 5a,b,c: Scaling behavior isn't affected by linearly lower costs.
The bottom line is this: when your data cost is decreasing faster than your hardware cost, the long-term solution cannot be to buy, rent, borrow, beg, or steal more hardware. The solution must lie in software and algorithms.
People who claim that cloud computing is going to provide an answer to the scaling issue with sequence, then, must be operating with some additional assumptions. Maybe they think the curves are shifted relative to one another, so that even 1000x costs are not a big deal - although figure 1 sort of argues against that. Like me, maybe they've heard that hard disks are about to start scaling way, way better -- if so, awesome! That might change the curves for data storage, if not analysis. Perhaps their research depends on using only a bounded amount of sequence -- e.g. single-genome sequencing, for which you can stop generating data at a certain point. Or perhaps they're proposing to use algorithms that scale sub-linearly with the amount of data they're applied to (although I don't know of any). Or perhaps they're planning for the shift in Moore's Law behavior that will come when that Amazon and other cloud computing providers build self-replicating compute clusters on the moon (hello, exo-atmospheric computing!) Whatever the plan, it would be interesting to hear their assumptions explained.
I think one likely answer to the Big Data conundrum in biology is that we'll come up with cleverer and cleverer approaches for quickly throwing away data that is unlikely to be of any use. Assuming these algorithms are linear in their application to data, but have smaller constants in front of their big-O, this will at least help stem the tide. (It will also, unfortunately, generate more and nastier biases in the results...) But I don't have any answers for what will happen in the medium term if sequencing continues to scale as it does.
It's also worth noting that de novo assembly (my current focus...) remains one of the biggest challenges. It requires gobs of the most expensive computational resource (RAM, which is not scaling as fast as disk and CPU), and there are no good solutions on the horizon for making it scale faster. Neither mRNAseq nor metagenomics are well-bounded problems (you always want more sequence!), and assembly will remain a critical approach for many people for many years. Moreover, cloud assembly approaches like Contrail are (sooner or later) doomed by the same logic as above. But it's a problem we need to solve! As I said at PyCon, "Life's too short to tackle the easy problems -- come to academia!".
--titus
p.s. If you want to play with the curves yourself, here's a Google Spreadsheet, and you can grab a straight CSV file here.
posted at: 12:47 | path: /aug-11 | 8 comments