Tue, 20 Dec 2011

Paper draft: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs


(updated to point to http://arxiv.org/).

Authors: Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, C. Titus Brown

Abstract:

The memory requirements for de novo assembly of short-read shotgun sequencing data from complex microbial populations are an increasingly large practical barrier to environmental studies. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples, allowing us to reduce the size of the de novo assembly process for metagenomes with a "divide and conquer" algorithm. This graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to store assembly graphs in as little as 4 bits per k-mer. We use this approach to achieve a 20-fold decrease in memory for the assembly of a soil metagenome sample.

The paper is available on arXiv here: http://arxiv.org/abs/1112.4193. Comments are welcome! We're planning to submit it to PNAS later this week.

I'll write a longer blog post about it soon.

--titus

posted at: 09:10 | path: /dec-11 | 0 comments

Tags: , , ,


Sun, 11 Dec 2011

Data Intensive Science, and Workflows


I'm writing this on my way back from Stockholm, where I attended a workshop on the 4th Paradigm. This is the idea (so named by Jim Gray, I gather?) that data-intensive science is a distinct paradigm from the first three paradigms of scientific investigation -- theory, experiment, and simulation. I was invited to attend as a token biologist -- someone in biology who works with large scale data, and thinks about large scale data, but isn't necessarily only devoted to dealing with large scale data.

The workshop was pretty interesting. It was run by Microsoft, who invited me & paid my way out there. The idea was to identify areas of opportunity in which Microsoft could make investments to help out scientists and develop new perspectives on the future of eScience. To do that, we played a game that I'll call the "anchor game", where we divvied up into groups to discuss the blocks to our work that stemmed from algorithms and tools, data, process and workflows, social/organizational aspects. In each group we put together sticky notes with our "complaints" and then ranked them by how big of an anchor they were on us -- "deep" sticky notes held us back more than shallow sticky notes. We then reorganized by discipline, and put together an end-to-end workflow that we hoped in 5 years would be possible, and then finally we looked for short- and medium-term projects that would help get us there.

The big surprise for me in all of this was that it turns out I'm most interested in workflows and process! All of my sticky notes had the same theme: it wasn't tools, or data management, or social aspects that were causing me problems, but rather the development of appropriate workflows for scientific investigation. Very weird, and not what I would have predicted from the outset.

Two questions came up for me during the workshop:

Why don't people use workflows in bioinformatics?

The first question that comes to mind is, why doesn't anyone I know use a formal workflow engine in bioinformatics? I know they exist (Taverna, for one, has a bunch of bioinformatics workflows); I'm reasonably sure they would be useful; but there seems to be some block against using them!

What would the ideal workflow situation be?

During the anchor game, our group (which consisted of two biologists, myself and Hugh Shanahan; a physicist, Geoffrey Fox; and a few computer scientists, including Alex Wade) came up with an idea for a specific tool. The tool would be a bridge between Datascope for biologists and a workflow engine. The essential idea is to combine data exploration with audit trail recording, which could then be hardened into a workflow template and re-used.

---

Thinking about the process I usually use when working on a new problem, it tends to consist of all these activites mixed together:

  1. evaluation of data quality, and application of "computational" controls
  2. exploration of various data manipulation steps, looking for statistical signal
  3. solidifying of various subcomponents of the data manipulation steps into scripted actions
  4. deployment of the entire thing on multiple data sets

Now, I'm never quite done -- new data types and questions always come up, and there are always components to tweak. But portions of the workflow become pretty solid by the time I'm halfway done with the initial project, and the evaluation of data quality accretes more steps but rarely loses one. So it could be quite useful to be able to take a step back and say, "yeah, these steps? wrap 'em up and put 'em into a workflow, I'm done." Except that I also want to be able to edit and change them in the future. And I'd like to be able to post the results along with the precise steps taken to generate them. And (as long as I'm greedy) I would like to work at the command line, while I know that others would like to be able to work in a notebook or graphical format. And I'd like to be able to "scale out" the computation by bringing the compute to the data.

For all of this I need three things: I need workflow agility, I need workflow versioning, and I need workflow tracking. And this all needs to sit on top of a workflow component model that lets me run the components of the workflow wherever the data is.

I'm guessing no tool out there does this, although I know other people are thinking this way, so maybe I'm wrong. The Microsoft folk didn't know of any, though, and they seemed pretty well informed in this area :).

The devil's choice I personally make in all of this is to go for workflow agility, and ignore versioning and tracking and the component model, by scripting the hell out of things. But this is getting old, and as I get older and have to teach my wicked ways to grad students and postdocs, the lack of versioning and tracking and easy scaling out gets more and more obvious. And now that I'm trying to actually teach computational science to biologists, it's simply appallingly difficult to convey this stuff in a sensible way. So I'm looking for something better.

One particularly intriguing type of software I've been looking at recently is the "interactive Web notebook" -- ipython notebook and sagenb. These are essentially Mathematica or matlab-style notebook packages that work with IPython or Sage, scientific computing and mathematical computing systems in Python. They let you run Python code interactively, and colocate it with its output (including graphics); the notebooks can then be saved and loaded and re-run. I'm thinking about working one or the other into my class, since it would let me move away from command-line dependence a bit. (Command lines are great, but throwing them at biologists, along with Python programming, data analysis, and program execution all together, seems a bit cruel. And not that productive.)

It would also be great to have cloud-enabled workflow components. As I embark on more and more sequence analysis, there are only about a dozen "things" we actually do, but mixed and matched. These things could be hardened, parameterized into components, and placed behind an RPC interface that would give us a standard way to execute them. Combined with a suitable data abstraction layer, I could run the components from location A on data in location B in a semi-transparent way, and also record and track their use in a variety of ways. Given a suitably rich set of components and use cases, I could imagine that these components and their interactions could be executed from multiple workflow engines, and with usefully interesting GUIs. I know Domain Specific Languages are already passe, but a DSL might be a good way to find the right division between subcomponents.

I'd be interested in hearing about such things that may already exist. I'm aware of Galaxy, but I think the components in Galaxy are not quite written at the right level of abstraction for me; Galaxy is also more focused on the GUI than I want. I don't know anything about Taverna, so I'm going to look into that a bit more. And, inevitably, we'll be writing some of our own software in this area, too.

Overall, I'm really interested in workflow approaches that let me transition seemlessly between discovery science and "firing for effect" for hypothesis-driven science.

A few more specific thoughts:

In the area of metagenomics (one of my research focuses at the moment), it would be great to see img/m, camera, and MG-RAST move towards a "broken out" workflow that lets semi-sophisticated computational users (hi mom!) run their stuff on the Amazon Cloud and on private HPCs or clouds. While I appreciate hosted services, there are many drawbacks to them, and I'd love to get my hands on the guts of those services. (I'm sure the MG-RAST folk would like me to note that they are moving towards making their pipeline more usable outside of Argonne: so noted.)

In the comments to my post on Four reasons I won't use your data analysis pipeline, Andrew Davison reminds me of VisTrails, which some people at the MS workshop recommended.

I met David De Roure from myExperiment at the MS workshop. To quote, "myExperiment makes it easy to find, use, and share scientific workflows and other Research Objects, and to build communities."

David put me in touch with Carole Goble who is involved with Taverna. Something to look at.

In the cloud computing workshop I organized at the Planet and Animal Genome conference this January, I will get a chance to buttonhole one of the Galaxy Cloud developers. I hope to make the most of this opportunity ;).

It'd be interesting to do some social science research on what difficulties users encounter when they attempt to use workflow engines. A lot of money goes into developing them, apparently, but at least in bioinformatics they are not widely used. Why? This is sort of in line with Greg Wilson's Software Carpentry and the wonderfully named blog It will never work in theory: rather than guessing randomly at what technical directions need to be pursued, why not study it empirically? It is increasingly obvious to me that improving computational science productivity has more to do with lowering learning barriers and changing other societal or cultural issues than with a simple lack of technology, and figuring out how (and if) appropriate technology could be integrated with the right incentives and teaching strategy is pretty important.

--titus

p.s. Special thanks to Kenji Takeda and Tony Hey for inviting me to the workshop, and especially for paying my way. 'twas really interesting!

posted at: 07:00 | path: /dec-11 | 8 comments

Tags: , , , ,


Tue, 06 Dec 2011

Is Discovery Science Really Bogus?


This blog post was inspired by two recent events.

First, in response to a NY Times article about the "data deluge" affecting biologists, one of my Facebook friends said something like "stop whining about how hard it is to analyze the data and do some good experiments instead!" I vehemently disagreed with this simple statement -- but why??

Second, I used the 4th domain paper by Jonathan Eisen in my computational science class, and we discussed how one would reject or accept the 4th domain with more confidence. Somewhat to my surprise, my own conclusion was that I would ... sequence everything! Yep, just go out and sequence everything I could get my hands on in the tree of life, as well as a bunch of communities from ocean and soil. I was surprised to reach this conclusion (which we can debate on its own merits some other time) because my background is in real science, not "discovery science", and I'd been trained to believe that the discovery-based approach of shaking the trees to see what falls out was kind of unintellectual and unscientific.

Both of these events made me rethink my attitude towards discovery science. The first, because the guy that told us all to stop whining isn't dumb, but I also don't think he's entirely or even mostly right; and the second, because, together with the first, it made me challenge the conventional wisdom in molecular biology that hypothesis driven science is the Right Way.

Hypothesis-driven biology

The way many (most? all?) molecular biologists work is something like this: they develop a theory about some process (physiological or developmental or genomic or whatnot), develop a specific hypothesis or set of hypotheses, and then figure out how to test those hypotheses using controlled experiments. "Hypothesis: objects of near equal mass accelerate equally in a uniform gravitational field; test: drop two objects of equal mass; control for wind resistance." In developmental biology, the molecular field with which I'm most familiar, you might say "I think that the pax3/7 gene is necessary for neural crest specification in these cells at this time, so I'm going to knock it down and see what happens to neural crest." The key point is that you always need to reframe your theory in the form of a fairly specific hypothesis, and then figure out a way to test it. Training students to develop, frame, and test hypotheses is What We Do as professors. When you write grant proposals, you write about why you have developed a specific set of hypotheses (that is, you justify your hypotheses by appealing to prior work and preliminary results), claim that these hypotheses are important or interesting, and then argue vigorously that you are the right person to receive beaucoup bucks to test these hypotheses. Hypothesis-driven research is what we do!

This somewhat dogmatic picture obscures a number of inconvenient truths, however. First of all, many grant agencies (and reviewers) are risk averse, so they prefer to fund things that appear as certain as possible. This means you have to walk the line between crippling your hypotheses by predetermining them with your data, and coming up with an interesting and novel hypothesis -- if you've already tested your hypothesis and you're pretty sure it's right, then it's no longer that interesting to test! Second, no research plan survives contact with reality. So what you really do is sketch out a small extension to a near-certain hypothesis, get funded (admittedly this step is rather rare...), and then discover that your extension is incredibly simplistic and most likely wrong and a dead-end alley. So you end up working on something else completely. That is, you get the grant to work on X and end up working on Z -- not necessarily too far away from X, but not X, either. This leads to a third truth, which is that you get grants because you've been able to make a successful argument, not because anyone expects you to accomplish exactly what is in the grant. The only people that really take your grant proposal literally are the contracts & grants people at your university; the grant administrator and you both understand that this is research, and a real researcher is likely to end up someplace other than where you intended to go, at least in detail.

(I always like to cite Einstein at this point in a conversation: "If we knew what we were doing, it wouldn't be called research, would it?")

What happens when a student confronts this situation? Well, usually students have to write research proposals as part of their qualifying exams, and often students try to stick to those research proposals even when their experiments go awry. I've been part of a bunch of committees where the student will say "ok, so these were our original aims, and here's where I've gone away from them". They don't seem to understand that we don't care (or at least I don't): the real point of the qual is to make sure they know how to frame a hypothesis, and to ensure that they know what a testable hypothesis looks like, smells like, feels like, and tastes like. After that, your research will go where it goes, and that's as it should be. (Aside: my most frustrating (but still positive) moment as a committee member occurred when a student presented a hypothesis and talked about how she was going to test that hypothesis with method X, method Y, data analysis Z, etc. We asked her a bunch of questions and she seemed strangely confident and specific about the expected results. Upon further probing, it turned out that she'd already done the experiments and knew the answers, but thought that the qual needed to be about her hypotheses de novo, and shouldn't take into account actual data she'd generated. WTF??)

Unfortunately, we often stick students with projects where there is no honest way to frame a specific hypothesis. This is true of young labs, which may not have enough specific data to develop a good testable hypothesis for their system and are still casting about for a specific direction to take; and it's increasingly true of established labs that are using next-gen sequencing.

Cue next story: a student of mine was (and still is) part of a collaboration where we were doing bioinformatic analysis of genome-scale disease data. The other professor had funding and generated the data, which was basically sequencing RNA from an affected organ. There literally was no specific hypothesis other than "let's go see how this disease is affecting the spleen transcriptional response." This was then given to my student, who happily pounded away at the data for a few months (making many more trenchant observations about mRNAseq than about the disease, but nonetheless making progress). It came time for his committee meeting, and his committee insisted that he present a hypothesis. He cast about for a while, and finally come up with "there will be a differential transcriptional response to this disease in the spleen." This was nearly disastrous, of course, because it's simply not very specific! Sure, it's a hypothesis, and it's almost certainly true, but it's not specific enough to be useful. So my student nearly failed his committee meeting (note that I was a young(er) prof at this point, and hadn't seen this coming; my fault!) Why am I telling this story, though? Because my collaborator, who had generated the data in a hypothesis free manner, was a member of the committee, and was very disturbed by my student's lack of a hypothesis. Why? Because it was considered very important that our students be doing hypothesis driven science, even though neither he nor I had directed the project that way!

Before I continue on to draw a lesson from this, let me say: I'm not anti-hypothesis in any way. The student is, eventually, going to have to develop a hyp, or he isn't going to get his PhD; he knows that, and I know that. But we were still working on generating hypotheses from the data, and didn't have them ready at hand; developing the hypotheses was actually the first, very significant component of the project. Another point is that the committee was completely unprepared for this. And a third was that the guy who generated the data was so wedded to this hypothesis-driven approach that he basically ended up being hypocritical -- which I point out to him regularly :). (Another component that actually played a smaller part than I'd feared was the computational nature of the research: a certain subset of molecular biologists will vehemently deny that useful work can be done without a pipette man in hand. This either leads to ineffective one-handed typing in bioinformatics, or vociferous arguments among professors -- neither good for committee meetings.)

The lesson I want to draw from this anecdote is simple: hypothesis-driven science is dead!

No, no, not at all. More seriously, I think that as data generation becomes easier in some fields of biology, we should recognize that an extended period of hypothesis generation through discovery-driven approaches may be useful and necessary for many projects. Many biologists may not be any good at this, because they've been honed for decades to focus on moving as quickly as possible to a hypothesis based on a relatively small amount of hand-curated data; but in practice, hypotheses are now cheap (because data is plentiful) and I think we should focus on developing likely hypotheses and winnowing out the dumb 'uns computationally before we ever pick up a pipette man to test 'em. That is, expand the hypothesis-generation and analysis stages so that we're more likely to develop a comprehensive and interesting hypothesis.

About Models, and Model Systems

One of the limitations of the drive to proximal hypotheses is that you need to have tractable systems -- systems in which you can relatively quickly and easily test hypotheses. This leads to using models, and model systems. For example, Drosophila is a great model for genetics and development: it's been used for decades, and has led to at least one set of Nobel prizes for basic understanding of genetics. You can do lots and lots of things with it way more easily than you could imagine doing those same things in a mammalian system: mutagenize, resequence the genome from scratch, do all sorts of crosses in what appear to be a few weeks, etc. etc. But, whether you're interested in biomedical applications, or you're interested in population genetics, or whatnot, it's still just a model, and to build a connection to the broader set of science, you need to analogize the model in various ways. The bigger the field around the model system gets, the less the people feel the need to make the model explicit, and then the junior people forget about it. And so sometimes the model just doesn't apply. One of my favorite examples (just to pick on Drosophila and C. elegans, which are the two biggest invertebrate animal model systems) is from the early days of genomics. We sequenced mouse, and human, and Drosophila, and C. elegans, and saw that there were about 30% more types of genes and gene families in vertebrates. This led to a certain amount of breathless discussion about "the genes that made us vertebrates". Then we sequenced hydra (most emphatically not a vertebrate!), and discovered that it had almost all those gene families. Bang! It turned out that Drosophila and C. elegans were members of a monophyletic group, the Ecdysozoa, which had undergone extensive gene loss! So in some ways, Drosophila and C. elegans are really bad models for vertebrate genomics! They're from a relatively distant branch of the animals, they have small genomes partly because they were chosen for rapid breeding, and there are lots of things that are just different about them. They're still awesome, and they deserve a lot of study, but the history of genetic research on them really shows both the pluses and minuses of model systems: sometimes a model system that's great for one reason is horrible for another.

The same thing happens in ecology and population genetics, it seems to me. There's a lot of mathematical models that are simple and tractable and that let you "test hypotheses" about certain kinds of relationships, but then you have to determine how relevant those models are to reality. People would prefer not to spend that kind of time or effort -- because it's time and effort not spent generating and testing hypotheses. So the connection is made only for a few kinds of systems, which limits the vision of people doing research.

What about cancer?

I think another catalyst that made me think about all of this is the book The Emperor of Maladies, a Pulitzer-prize winning biography of cancer. There you see again and again how hypothesis driven approaches basically failed, while we slowly developed diagnostic tools and (frankly) guessed randomly about how to deal with cancer. Only recently have we started to gain an understanding of exactly what's going on at the genomic and genetic level, but it's still slow to make its way into therapeutic use; chemo -- killing the cancer slightly more quickly than the normal cells -- is still the main treatment, for chrissakes. Do you think we would do that if we had any other option?? Reading the book, the guy who developed the Pap smear (an excellent diagnostic for cervical cancer) did so on guinea pigs, because it was the only way he could detect estrus in guinea pigs -- by scraping the cervix. He spent 20 years trying to find a biomedical use for it! That's not hypothesis-driven science. Epidemiology has probably had a greater effect on cancer treatment than anything else, by tracking down the specific causes of various conditions like lung cancer, long before we were thinking about cellular mutations.

In my class the other day, the one where we talked about the 4th domain work, James Foster from U Idaho made the point that observation in biology used to be called "Natural History". One of the greatest successes of Natural History? Evolution, the greatest explanatory theory in biology, came directly from the synthesis of vast amounts of observation, with no experiment involved. It took decades for Darwin and others to put it together, and decades more for it to be validated in a hypothesis-driven framework (I'm thinking the finches, or the Lenski E. coli experiment, here; there are probably better places to cite that I don't know about).

The Molgula

When my Facebook friend & colleague talked about how we should stop bitching about data processing and start thinking about experiments, I'm pretty sure he meant that people should be better hypothesis-driven scientists. My instinctive reaction to that thesis is that he's not right (nor is he entirely wrong -- hypothesis-driven science is still necessary, just not sufficient!)

One of my current projects is working on a group of sea squirts, the Molgula, that underwent a dramatic morphological change in the larval form: many of the larvae lack tails. We want to know, how did this happen?

To address this question, we went out and generated about 600 million reads of mRNAseq from a variety of larval stages for a tailed sea squirt, a tailless sea squirt, and hybrid crosses between them. This has let us ask which genes are present, what their levels of expression are, and whether there is allele-specific expression of certain genes in one species over another (never mind, just trust me, it's important & interesting to know). In order to analyze this data - which amounts to about 80 GB of DNA, compressed -- we've had to invent a whole new series of data analysis and reduction tools. This is because the Molgula aren't well-studied model systems: they don't have genome sequences available, no large scale cDNA projects have been done on them, and the molecular tools for doing basic probes are still thin on the ground. It was far easier to spend $20k on sequencing and get an answer in a matter of months -- even counting the development of the data analysis tools -- than it was to do anything else.

Are we going to now go out and take our high-throughput data and analyze it and conclude, voila, we know why the tails aren't forming? No, we're not that dumb! But we are developing several early hypotheses based on the data we have, and we're checking to see if they're plausible in the face of tissue-specific gene expression assays (WMISH). Then we'll go and do the hypothesis-driven perturbations to see: is the tail being specified and failing to extend? Or is it not being specified at all?

It's worth pointing out that virtually everything known about tail development in the sea squirts comes from one particular species, Ciona intestinalis, which is now a pretty established model system: genome, database, EST projects, a whole community. The Molgula, however, which look morphologically pretty similar, are about as far away from Ciona (evolutionarily speaking) as you can get and still be a sea squirt. Wouldn't it be fascinating to know how tails develop in them? Well, if we hadn't lucked into some excellent seed funding for the Molgula project and been able to generate and analyze the vast amounts of sequence, we wouldn't be on our way to looking at them -- this kind of study is seen as a fishing expedition, not worthy of being funded.

This is really the problem with hypothesis-driven approaches, and the priority we give them: they focus us on the questions that can be answered fairly quickly and easily, and not necessarily on the big questions. Sometimes it's possible to find a fundable route to those big questions; sometimes not. In the latter case, the questions go unaddressed.

Soil metagenomes

The other big-ass data project I'll bring up is the Great Prairie Grand Challenge, in which the DOE JGI is sequencing literally terabases worth of DNA extracted from midwestern soil. The ultimate goal is to understand the microbial community composition and function.

Do we have any idea how to do that?

Well, the answer is, "not really". The field of metagenomics is still young, and it turns out to be technologically blocked. That is, the diversity of soil is so high that you need to sample it really deeply; but then the depth of sampling yields so much data, that you can't do anything clever with it computationally. This is one of the other focuses of my lab, and it's emphatically a long-term discovery-driven project. We have only a little idea of what we're looking for, and it's likely to be unrecognizable on the first four looks. We'll have to look and think deeply, AFTER solving the data analysis problems (which, again, I think we have. But it was really hard :).

Rumsfeldian science

One of my other favorite citations is that great Rumsfeld quote, about the known knowns, known unknowns, and unknown unknowns (in his case, with respect to invading Iraq -- oops). We know so little about biology that to restrict our gaze to the known knowns, or even to the known unknowns, is foolish.

Look again at this evolutionary tree of life, from Norm Pace's lab. We understand virtually nothing about the vast majority of those organisms. Sure, we can start to get at the commonalities of some aspects of protein composition, cellular organization, and genomics. But who knows what's out there? Certainly not me, and I suspect no one else. We have a long way to go.

To return to the original purpose of this rant, a lot of this "known unknown" and even more of this "unknown unknown" stuff involves looking at vast amounts of data and finding clever ways to grok the structure of the data, filter out stuff we think is uninteresting, and cherry pick the stuff that IS interesting. This is one of my focuses, and it is hard, specialized, time consuming, and wonderfully challenging. To hear other scientists say, dismissively, that we need to learn how to do proper experiments is a bit disheartening, and, even more problematically, rather short-sighted for the field.

Data -- especially the vast amounts of next-gen data starting to come from sequencers -- is usefully "hypothesis neutral". In Timo Hanny's defense of Chris Anderson's theory that "hypotheses are dead" in The 4th Paradigm, he pointed out that surely there is some point where "more" is different from "some". Being able to sensitively look at minor members of communities, or low-expressed genes and isoforms, will inevitably be informative; we shouldn't just discard it as "that useless discovery science stuff".

In conclusion:

A key part of doing good hypothesis driven science is to come up with good hypotheses based on large-scale observations of biological systems. We should respect that initial stage of observing more than we seem to. My graduate advisor, Eric Davidson, told me the famous analogy about scientific practice being similar to a drunk, having dropped his keys in a dark alleyway, looking for them under the street light; while some people spend their career carrying flashlights into dark corners and doing a really detailed search, and others work on the flashlights, I think it's also going to fruitful to turn up the wattage on the street light so that all of those dark corners get illuminated. And we'll need sharp eyes to search all that newly lit territory. DNA sequencing is turning up the wattage; let's develop the methods to find the nifty stuff that we can now see.

--titus

posted at: 17:25 | path: /dec-11 | 5 comments

Tags: , ,


Fri, 14 Oct 2011

Metagenomics: we have all the answers (they're just all different)


I'm just on my way back from a JGI workshop on metagenome informatics, and I thought I'd take the opportunity to write up a short review.

The workshop was, frankly, excellent. We saw a bunch of talks on metagenome assembly (my current interest) as well as single-cell sequencing approaches, and a whole spate of data analysis platform talks. The basic perspective seems to be this: nobody has the solution to everything, and those who do have some answers seem to get different answers from others. The one possible exception to this was Jill Banfield, who gave a really inspiring talk about how they're using metagenomics on low- and medium-complexity communities to do some excellent science.

(Advertisement: my talk on scaling metagenome assembly is online here, at slideshare.)

The only somewhat dissatisfying note to the whole conference was the overall lack of inter-project cooperation and cross-validation. While some of the analysis platform folk are talking to each other, I don't think they have a good handle on how to really test each other's software, much less combine forces; too much of it all boils down to "trust us -- our approaches work", which is simply not the way to do science. Period.

Pain

"The thing that is driving all of us is pain." -- Victor Markowitz

This single statement probably sums up the workshop best! One of the most obvious tensions in the workshop was between sensitivity of results and scaling. For many microbial populations, it is critical to sequence deeply in order to see rare variants and species; and yet our assembly and annotation tools are, generally speaking, unprepared to handle this volume of data (dozens of Gb, if not Tb). This leads to pain, as we desperately attempt to make use of the data to address our biology.

There were also quite clearly two camps of people. One group had experience with metagenomic data sets of simple- to middling-complexity: think human microbiome, with up to a few hundred species of microbes per sample. The other group was confronting water and most especially soil, which may have hundreds of thousands if not millions of species per sample. The first group was prone to saying things like "it's not that tough a problem, folks! you just need to analyze more sequence! and then you can do anything you want!" to which the second group would then say "yeah, that's the problem, innit? all that sequence?"

I think in a year or so it will be easier to characterize this gap in complexity. We're finally getting results from soil assembly, and it's clear that we need terabases of sequence to get good results; but we don't yet have the ability to quickly & confidently analyze those terabases of sequence. Once we do, we can make quantitative statements illustrating the divide.

It was also nice to hear that everyone had settled on the best possible data processing pipeline at the meeting! (Although perhaps coincidentally, it was almost always their own software.)

Standards

The second day featured a lot of discussion of standards. Many people seemed to have a community standard that, if everyone would just start using theirs, would solve all the problems!!!

http://imgs.xkcd.com/comics/standards.png

Of particular interest to me, Owen White gave a talk where he presented on the Open Source Data Framework. This is basically a central-server NoSQL implementation for storing metadata and data together. Eventually it will support data migration for locality, and lots of other nice features. The response to building Yet Another Big Database (But This Time, With No Schema!!) was muted, although I'm personally quite bullish on the idea that maybe, just maybe, we can have a place where the data and metadata are combined (!!!). It seems that the alternative to an OSDF-like database is to continue using flat files -- maybe even with URLs sometimes!! -- and I see that approach continuing to engender chaos. An alternative would be nice.

(I hate SQL for dataset storage, but maybe that's not a majority opinion.)

One particularly good quote from Victor Markowitz on the OSDF proposal, paraphrased: "This system will let us spread misunderstandings further and more quickly!" Hmm, is that a negative?

Also, someone said the HMP has 14 trillion reads, or 1.4 petabases of sequence. Whuh? Apparently this is what IMG/HMP and METAREP are being built to analyze. Yikes.

Big Black Data Analysis Boxes

Big black boxes for data analysis is a running theme in metagenomics, although I'm not entirely sure why -- something in the water? (We should sequence that.) Things like MG-RAST, IMG/M, CAMRA, and the various K-base projects offer to take your data (raw or not); fold, spindle, and mutilate it; and then return Results.

This approach gives me the heebiejeebies. I have lots of questions: What software are they running, how, with what parameters? What kind of internal QC do they have, and how can I run it myself on my own data? What version of what databases are used? What filtering do they do? The response (more or less) always seems to be "TRUST us! We know what we're doing!"

That line works better in a sitcom than it does in real life.

I really, really want to see a platform mentality evolve. Cue Steve Yegge's "platform" rant: http://news.ycombinator.com/item?id=3101876 or http://steverant.pen.io/

What's the goal, anyway? Reads vs. assembly

Several people were particularly grumpy about this whole "assembly" thing. I can only imagine this stems from having to go to too many meetings where a lot of really theoretical assembler discussion goes on, and is nonetheless treated like the central activity necessary for metagenomics. Since I work on assembly now, I enjoy those talks, but I fully understand that I am a masochist when it comes to science, and not everyone likes assembly.

Nonetheless I think people skeptical of assembly are largely wrong, at least in the overall concept. Assembly is a really important approach in metagenomics for several reasons, which Kostas Mavrommmmmmmmatis laid out the second day: scientifically speaking, assembly gives way better statistical signals for both gene finding and analysis of variation; you can also get linkage information from assembly, which is important for operon analysis. And, just as important -- from a pragmatic perspective, the reduction in data size from assembly saves both space and computation time, and assembly smooths out errors in way that not much else does.

That's not to say assembly is easy, or a panacea, but I think it's an incredibly valuable approach for all those reasons.

As for single read analysis, I am deeply skeptical of any conclusions reached from analyzing reads of length 120 or less -- that's not much signal, bub. (Note, I'm also guilty of this; check out my publication with Victoria Orphan in 2006. What can I say? I was younger and naiver.) Friends don't let friends analyze Illumina short reads. And yet, it seems to be a prevalent, perhaps even dominant, activity in metagenomics. Boo.

So: assemble, man. It's needed.

But as for goals... well, Brooklin Gore said it well when he said, "You come to understand that the purpose of a machine is ultimately not to run programs, but to solve problems." What problems do we want to solve?

Good question.

There wasn't much unity in the goals of the attendees. Some people wanted whole genomes. Others wanted genes. Yet others wanted variations in those genes. Presumably most people wanted metabolic profiles of one sort or another, except for those who didn't. I think this reflects something that Guy Cochrane pointed out: sequencing is fast becoming a common feature to much of biology, and it is extremely hard to address everyone's needs in one platform. So, that makes it even more unlikely that one single approach or one single analysis platform will answer even a majority of users's needs. There's a lot of room out there, folks.

On benchmarking

Everyone got up and gave a different set of assembly results, and it became kind of obvious that everyone optimizes their assembler for a specific set of statistics and then reports only those. Hey -- what about a standard set of benchmarks??

Well, there was a strange reluctance on the part of some Senior Scientists to invest in assembly benchmarking. As near as I could make out, this was because of a fear that assembler authors would then dive into optimizing for those benchmarks at the expense of Platonic truth. Fair 'nuff. But I think there's gotta be a happy medium between No Benchmarks and Only Benchmarks. Right now I find it extremely difficult to figure out which assemblers are better for what purposes, and I'm pretty sure everyone else is just as confused (modulo the authors of assemblers themselves, who are generally quite positive that their assembler is the best).

Anyhoo, I'm thinking quite hard about setting up a couple of single cell and metagenomic data sets, along with assembly and evaluation pipelines, on Amazon Web Services. That should at least make it easy to run people's assemblers on the same data sets & compare the results.

--

Hmm, conclusion time...

The metagenomic and single cell data is already coming, thanks to our sequencing center overlords.

Generally speaking, we don't know how to handle it, either by assembly or by single read analysis. We certainly don't know how to scale all the analyses that everybody wants to do.

The data will still be coming, regardless.

Good times!

--titus

posted at: 20:25 | path: /oct-11 | 0 comments

Tags: , ,


Sun, 29 Aug 2010

Assembly is hard because it's not decomposable


(with Adina Howe, Jason Pell, Rosangela Canino-Koning, and Arend Hintze).

Introduction

A few weeks ago I blogged a bit about a k-mer filtering system, khmer, that we were using to reduce metagenomic data to a more tractable size by throwing out error-prone reads (see A memory efficient way to remote low-abundance k-mers from large DNA data sets). No sooner had we tried that, than did we realize that we were probably primarily throwing away good, if low-abundance data (see Illumina reads and their features). No matter: we couldn't assemble the original data sets anyway, so we had to get rid of some of it, right?

The subject of this blog post is not on how to best throw away data. (I'll address that in a few weeks.) Instead, it's on why we have to throw away data in the first place. More precisely,

Why is assembly hard?

First, some background. Imagine you have some long-ish strings (1mn - 200 mn in length), composed of only the letters A, C, G, and T, and you want to know what the sequence of the strings is. You can't actually read the sequences directly; they're too physically small. But you can randomly retrieve short subsequences ~100-1000 letters in length from the original long sequences. You don't know where they're from on the original sequence, or even which of the original sequences they're from. And the process of retrieval is error-prone, so you can't even trust the exact sequence you get. But you do know that, by and large, the short sequences are mostly correct; and (the most important bit) that you can get as many of these short sequences as you want, within $$ limitations.

From this kind of information you want to reconstruct the original sequences.

This is a basic description of the process of shotgun sequencing, in which you take DNA, shred it, and then sequence from it randomly -- many, many times. And it lays out the basic problem of assembly, too: you want to figure out how to reconstruct the original sequences from the little subsequences that you actually have.

If you are a computer scientist, you can probably already think of some basic ways to proceed. For example, you could do an all-by-all comparison of the short sequences, lay out which ones overlap and how, build a map of the overlaps, and try to build a tiling path that maximizes the connectivity of your map. Voila! Some approximation of the original sequences results! This approach is known as the overlap-layout-consensus approach, where at the end you produce a consensus view of the original sequence based on all the reads you have.

If you are a computer scientist or someone who programs for a living, you will also immediately recognize this as a rilly rilly hard problem! Forget biological peccadilloes; just doing this efficiently for large collections of sequences is computationally quite difficult. In particular, the all-by-all comparison is brutal: the number of comparisons scales as N**2 with the number of sequences N, so even if it's relatively efficient to compare two sequences, the problem behaves poorly as your data set grows. Plus, building a map of the overlaps is another hard problem: holding all that information in memory requires (yep!) O(N**2) memory, which is not cheap.

Is there any easy way to break down the problem? After all, big computers aren't cheap, but small computers are; so if you could split the problem into many smaller chunks, you could imagine using a grid or Beowulf approach, and just buying lots and lots of cheap hardware to scale.

Alas, the problem isn't easy to subdivide. It's easy to see why, if you think about the nature of the original sequences. Here's a little diagram; suppose, for example, that you have four subsequences all derived from one original sequence:

(orig) atggaccagatgagagcatgagccatggacggatcatggaaaacggttaaaaggggcatgg

(1)    atggaccagatgagagca
(2)                 gagcatgagccatggacggatc
(3)                                  ggatcatggaaaacggttaaaa
(4)                                                  ttaaaaggggcatgg

If the layout above is the only way that subsequences 1-4 overlap and can assemble, then to decompose the overlap problem across multiple computers would involve sending (1) and (2) to one computer, and (3) and (4) to another, assembling them there, and then taking the results and composing them on a shared node. Unfortunately, to do this efficiently currently requires that you know that 1 and 2 overlap, and that 3 and 4 overlap -- which is basically the problem that you already need to solve!

As I understand it -- I'm not a computer scientist unless you look at my letterhead -- there is simply no efficient way to decompose the overlap-layout-consensus assembly algorithm without either assuming something about the structure of the data, and/or introducing errors. (If you disagree, I'd appreciate either a reference or an implementation; thanks ;)

The second, or possibly third, generation of assemblers

OK, but computer scientists and computational biologists aren't dumb, and they like to tackle hard problems, and frankly this is an incredibly important problem to solve (for all sorts of reasons that you'll have to trust me on for now). Moreover, N^2 scaling is simply unacceptable!

Newer assemblers use a de Bruijn graph approach. Essentially, this involves breaking the subsequences down into fixed-length words of length k, and constructing an overlap graph. For example, taking the sequences above,:

(orig) atggaccagatgagagcatgagccatggacggatcatggaaaacggttaaaaggggcatgg

(1)    atggaccagatgagagca
(2)                 gagcatgagccatggacggatc
(3)                                  ggatcatggaaaacggttaaaa
(4)                                                  ttaaaaggggcatgg

you would break the original sequences down into words of length (say) 5, yielding:

atgga   gatga   catga   atgga   atcat   aaacg    aaagg
 tggac   atgag   atgag   tggac   tcatg   aacgg    aaggg
  ggacc   tgaga   tgagc   ggacg   catgg   acggt    agggg
   gacca   gagag   gagcc   gacgg   atgga   cggtt    ggggc
    accag   agagc   agcca   acgga   tggaa   ggtta    gggca
     ccaga   gagca   gccat   cggat   ggaaa   gttaa    ggcat
      cagat   agcat   ccatg   ggatc   gaaaa   ttaaa    gcatg
       agatg   gcatg   catgg   gatca   aaaac   taaaa    catgg
                                                aaaag

The overlaps between k-mers now implicitly give you a graph connecting each k-mer to all overlapping k-mers; and if you can find a path that traverses every node in this graph once, you will have your original contig.

Note that this actually works, although of course k must be much bigger than 5 in practice, and there are all sorts of cute tricks you must play to do a good job of disentangling complicated graphs.

Why is this an advantage over the overlap/layout/consensus approach that we looked at first? I'm not sure I've identified all the reasons, but there are at least two very important ones.

First, memory usage. While your memory usage for finding overlaps grows > O(N) with the overlap approach (with sparse matrices it should be N log N, I think?), the de Bruijn graph approach consumes only as much memory as you need to represent each new k-mer (so, with the number of novel k-mers) as well as the connections between them (which can be implicitly represented if you have efficient k-mer lookup). For large, deeply sequenced data sets this is going to be a huge savings: there are only three billion bases in the human genome, and probably only two billion unique k-mers of length 32 -- so if you can store k-mers efficiently (hint: you can) then the de Bruijn graph approach is really great.

Second, k-mers and k-mer overlaps can be stored and queried efficiently -- you just use a hash table or a trie structure. For example, you can store all 4**17 k-mers of length 17 as 34-bit offsets in a hash table (2 bits per DNA base), or you can use a branching trie structure to store arbitrarily long k-mers (see tallymer). Hash tables are be efficient (if big) representations for densely occupied k-mer spaces, while tries will be efficient for sparsely occupied k-mer spaces. Arbitrary length sequences are comparatively difficult to store and query.

The de Bruijn graph approach is what Velvet, ABySS, and SOAPdenovo use, and it seems to work well.

So what's the problem?

Scaling. Scaling is the problem.

Well, that and the sequencing companies and the biologists.

Let me explain. Sequencing companies are producing newer and bigger and better machines, that produce more and more sequence, every week. The Illumina GA2 produces 10-100 Gb of sequence per run now. The HiSeq 2000 is going to produce even more enormous amounts of sequence as soon as we get one. And more, lots more, is on the way.

This wouldn't be a problem if biologists would just stick to the exciting old problems, like resequencing humans and doing transcriptomes etc. But noooo, biologists see these juicy new sequencers and think -- hey! I could sequence populations of organisms! Or, like, 30 new organisms at once! Or 30 transcriptomes at once! And it will be cheap! (And we'll have someone else do the bioinformatics, which is easy, right? Right?)

So the sequencing companies are producing newer and cheaper and faster sequencing machines, and the biologists are using them to tackle ever more exciting and novel and challenging biological questions, and ... guess what? Our existing tools and approaches don't scale very well.

For one very specific example, the de Bruijn graph approach breaks down completely if you are sequencing endlessly diverse populations, as we seem to be doing in metagenomics. If you have some high abundance organisms, and a lot more low abundance organisms, and you sequence the organism soup to some arbitrary level, the novel k-mers will swamp your assembler, and to no end -- because those k-mers are never going to assemble to anything big without more sequencing. In which case you've compounded your swamping problem in an attempt to solve your earlier problem.

Similar things happen with wild population sequencing, where you get new and diverse sequences every time you look at a new animal; humans, even with their relatively low diversity, are one fine example.

OK, so this is the problem to solve, and I think it's a really big problem. It's not decomposable so it can't be made to scale well, and we're already at the limit of our existing compute infrastructure for the data we already have. (See Terabase metagenomics -- the computational side and grim future for sequencing centers.) And as we try to inch the boundaries along, the sequencing companies are producing new and bigger machines to give us new and bigger amounts of data.

Are there any solutions? No really good ones, unfortunately. The solution du jour (see MetaHIT methods and my earlier blog posts) is to throw away low abundance data that you figure won't assemble, and/or subdivide the sequences by abundance, in the expectation that similar abundance sequences will come from the same original genome. These are basically approximation heuristics, hoping to reduce the data in such a way that the assembler can deal with it. The hope is that the assembler can do a not-terribly-bad-job of assembly based on the known structure of the population.

Moreover, the throwing-away-data solution won't scale very well; soon enough you'll be throwing away not just 90% of the data, but 99% of the data, just to get a tractable data set.

We are doomed, doomed I say! Clearly we should give up.

Anyway, this concludes part one of a series of blog posts on assembly. In part two, I plan to talk a bit about paired-end sequencing and repeat sequences.

--titus

p.s. An excellent assembly algorithm reference: Miller, Koren, and Sutton, Genomics, 2010.

posted at: 15:07 | path: /aug-10 | 4 comments

Tags: , ,