Wed, 04 Jan 2012

What I'm REALLY thinking about when I use your bioinformatics software


If you're like me, we pretend to care about the science in bioinformatics software. But what we really do is try to find reasons not to outright loathe the software -- because, lud knows, there are usually plenty of reasons to hate it.

In no particular order, here are the top 10 things I hate about your bioinformatics software. You know who I'm talking about.

  1. You posted it on SourceForge (and so I can't download the damn thing using a simple URL).
  2. You're not using version control (and hence are not a scientist).
  3. You put _ in the damned file name unnecessarily (that requires a shift key on my keyboard).
  4. fizbam-0.9.3.tar.gz either untars into the current directory, OR a directory named 'fizbam'. Alternatively, you named it fizbam-0.9.3-2011010101010101010101010101020.tar.gz and it untars into THAT monster of a name (and my ls goes off the screen).
  5. You have no README, or, if you do, it's a one-liner that refers to a URL (didn't I already download your damned software?) OR 5a. Your README file is in HTML (less is better than lynx, dontcha know?)
  6. Running 'make' rebuilds everything from scratch every time you run it (seriously?)
  7. There are neither tests nor examples; or, if there are, I can't run 'em, and even if they do run, I have no idea if the results are correct.
  8. The output is in some weird format and/or location (wait, I have to do a find to find the last file written, and then guess as to its format?)
  9. The command line options are poorly labeled and described, use random abbreviations, and/or are sensitive to order (unlike every good command-line parsing library written).
  10. You CaMEl-CaSED your software name so that not even tab completion can figure it out (program names should be all lower-case, as Darwin intended).

---

Post your top ten and send me the URLs... :)

--titus

posted at: 21:57 | path: /jan-12 | 9 comments

Tags: , ,


Tue, 20 Dec 2011

Paper draft: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs


(updated to point to http://arxiv.org/).

Authors: Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, C. Titus Brown

Abstract:

The memory requirements for de novo assembly of short-read shotgun sequencing data from complex microbial populations are an increasingly large practical barrier to environmental studies. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples, allowing us to reduce the size of the de novo assembly process for metagenomes with a "divide and conquer" algorithm. This graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to store assembly graphs in as little as 4 bits per k-mer. We use this approach to achieve a 20-fold decrease in memory for the assembly of a soil metagenome sample.

The paper is available on arXiv here: http://arxiv.org/abs/1112.4193. Comments are welcome! We're planning to submit it to PNAS later this week.

I'll write a longer blog post about it soon.

--titus

posted at: 09:10 | path: /dec-11 | 0 comments

Tags: , , ,


Sun, 11 Dec 2011

Data Intensive Science, and Workflows


I'm writing this on my way back from Stockholm, where I attended a workshop on the 4th Paradigm. This is the idea (so named by Jim Gray, I gather?) that data-intensive science is a distinct paradigm from the first three paradigms of scientific investigation -- theory, experiment, and simulation. I was invited to attend as a token biologist -- someone in biology who works with large scale data, and thinks about large scale data, but isn't necessarily only devoted to dealing with large scale data.

The workshop was pretty interesting. It was run by Microsoft, who invited me & paid my way out there. The idea was to identify areas of opportunity in which Microsoft could make investments to help out scientists and develop new perspectives on the future of eScience. To do that, we played a game that I'll call the "anchor game", where we divvied up into groups to discuss the blocks to our work that stemmed from algorithms and tools, data, process and workflows, social/organizational aspects. In each group we put together sticky notes with our "complaints" and then ranked them by how big of an anchor they were on us -- "deep" sticky notes held us back more than shallow sticky notes. We then reorganized by discipline, and put together an end-to-end workflow that we hoped in 5 years would be possible, and then finally we looked for short- and medium-term projects that would help get us there.

The big surprise for me in all of this was that it turns out I'm most interested in workflows and process! All of my sticky notes had the same theme: it wasn't tools, or data management, or social aspects that were causing me problems, but rather the development of appropriate workflows for scientific investigation. Very weird, and not what I would have predicted from the outset.

Two questions came up for me during the workshop:

Why don't people use workflows in bioinformatics?

The first question that comes to mind is, why doesn't anyone I know use a formal workflow engine in bioinformatics? I know they exist (Taverna, for one, has a bunch of bioinformatics workflows); I'm reasonably sure they would be useful; but there seems to be some block against using them!

What would the ideal workflow situation be?

During the anchor game, our group (which consisted of two biologists, myself and Hugh Shanahan; a physicist, Geoffrey Fox; and a few computer scientists, including Alex Wade) came up with an idea for a specific tool. The tool would be a bridge between Datascope for biologists and a workflow engine. The essential idea is to combine data exploration with audit trail recording, which could then be hardened into a workflow template and re-used.

---

Thinking about the process I usually use when working on a new problem, it tends to consist of all these activites mixed together:

  1. evaluation of data quality, and application of "computational" controls
  2. exploration of various data manipulation steps, looking for statistical signal
  3. solidifying of various subcomponents of the data manipulation steps into scripted actions
  4. deployment of the entire thing on multiple data sets

Now, I'm never quite done -- new data types and questions always come up, and there are always components to tweak. But portions of the workflow become pretty solid by the time I'm halfway done with the initial project, and the evaluation of data quality accretes more steps but rarely loses one. So it could be quite useful to be able to take a step back and say, "yeah, these steps? wrap 'em up and put 'em into a workflow, I'm done." Except that I also want to be able to edit and change them in the future. And I'd like to be able to post the results along with the precise steps taken to generate them. And (as long as I'm greedy) I would like to work at the command line, while I know that others would like to be able to work in a notebook or graphical format. And I'd like to be able to "scale out" the computation by bringing the compute to the data.

For all of this I need three things: I need workflow agility, I need workflow versioning, and I need workflow tracking. And this all needs to sit on top of a workflow component model that lets me run the components of the workflow wherever the data is.

I'm guessing no tool out there does this, although I know other people are thinking this way, so maybe I'm wrong. The Microsoft folk didn't know of any, though, and they seemed pretty well informed in this area :).

The devil's choice I personally make in all of this is to go for workflow agility, and ignore versioning and tracking and the component model, by scripting the hell out of things. But this is getting old, and as I get older and have to teach my wicked ways to grad students and postdocs, the lack of versioning and tracking and easy scaling out gets more and more obvious. And now that I'm trying to actually teach computational science to biologists, it's simply appallingly difficult to convey this stuff in a sensible way. So I'm looking for something better.

One particularly intriguing type of software I've been looking at recently is the "interactive Web notebook" -- ipython notebook and sagenb. These are essentially Mathematica or matlab-style notebook packages that work with IPython or Sage, scientific computing and mathematical computing systems in Python. They let you run Python code interactively, and colocate it with its output (including graphics); the notebooks can then be saved and loaded and re-run. I'm thinking about working one or the other into my class, since it would let me move away from command-line dependence a bit. (Command lines are great, but throwing them at biologists, along with Python programming, data analysis, and program execution all together, seems a bit cruel. And not that productive.)

It would also be great to have cloud-enabled workflow components. As I embark on more and more sequence analysis, there are only about a dozen "things" we actually do, but mixed and matched. These things could be hardened, parameterized into components, and placed behind an RPC interface that would give us a standard way to execute them. Combined with a suitable data abstraction layer, I could run the components from location A on data in location B in a semi-transparent way, and also record and track their use in a variety of ways. Given a suitably rich set of components and use cases, I could imagine that these components and their interactions could be executed from multiple workflow engines, and with usefully interesting GUIs. I know Domain Specific Languages are already passe, but a DSL might be a good way to find the right division between subcomponents.

I'd be interested in hearing about such things that may already exist. I'm aware of Galaxy, but I think the components in Galaxy are not quite written at the right level of abstraction for me; Galaxy is also more focused on the GUI than I want. I don't know anything about Taverna, so I'm going to look into that a bit more. And, inevitably, we'll be writing some of our own software in this area, too.

Overall, I'm really interested in workflow approaches that let me transition seemlessly between discovery science and "firing for effect" for hypothesis-driven science.

A few more specific thoughts:

In the area of metagenomics (one of my research focuses at the moment), it would be great to see img/m, camera, and MG-RAST move towards a "broken out" workflow that lets semi-sophisticated computational users (hi mom!) run their stuff on the Amazon Cloud and on private HPCs or clouds. While I appreciate hosted services, there are many drawbacks to them, and I'd love to get my hands on the guts of those services. (I'm sure the MG-RAST folk would like me to note that they are moving towards making their pipeline more usable outside of Argonne: so noted.)

In the comments to my post on Four reasons I won't use your data analysis pipeline, Andrew Davison reminds me of VisTrails, which some people at the MS workshop recommended.

I met David De Roure from myExperiment at the MS workshop. To quote, "myExperiment makes it easy to find, use, and share scientific workflows and other Research Objects, and to build communities."

David put me in touch with Carole Goble who is involved with Taverna. Something to look at.

In the cloud computing workshop I organized at the Planet and Animal Genome conference this January, I will get a chance to buttonhole one of the Galaxy Cloud developers. I hope to make the most of this opportunity ;).

It'd be interesting to do some social science research on what difficulties users encounter when they attempt to use workflow engines. A lot of money goes into developing them, apparently, but at least in bioinformatics they are not widely used. Why? This is sort of in line with Greg Wilson's Software Carpentry and the wonderfully named blog It will never work in theory: rather than guessing randomly at what technical directions need to be pursued, why not study it empirically? It is increasingly obvious to me that improving computational science productivity has more to do with lowering learning barriers and changing other societal or cultural issues than with a simple lack of technology, and figuring out how (and if) appropriate technology could be integrated with the right incentives and teaching strategy is pretty important.

--titus

p.s. Special thanks to Kenji Takeda and Tony Hey for inviting me to the workshop, and especially for paying my way. 'twas really interesting!

posted at: 07:00 | path: /dec-11 | 8 comments

Tags: , , , ,


Tue, 06 Dec 2011

Four reasons I won't use your sequence analysis pipeline


(and some related thoughts on reproducibility in computational science)

In a recent news article on the "data deluge" in biology, I was quoted as saying "It's not at all clear what you do with that data. Doing a comprehensive analysis of it is essentially impossible at the moment." So, naturally, a company picked up on that. I received an e-mail saying, in part, "...we can provide solutions to solve all of your NGS analysis problems."

I do wonder a bit that they saw me cited as a "bioinformatics specialist" and then made the leap to assume that I was incompetent at providing my own solutions, but then again I do the same thing to others, so I guess I shouldn't complain. And the problems I'm talking about are, literally, unsolved research problems -- which leaves me a bit skeptical that a company I've never heard of will have a solution. (I offered to send them our 500 GB soil metagenome data set. No response yet.) But my gut reaction was actually even more visceral: it was "hell, no!" There's no way I'd use someone else's pipeline!

So without further ado, here are four reasons why there is no way in hell I will use your sequence analysis pipeline.

  1. Your (claimed) methods aren't open, or open source.

It's Not Science to use methods that aren't well understood or open to examination. Period.

Think CLC Workbench here -- they have some kind of assembly software that they sell. What does it do? No one is quite sure, other than that it's de Bruijn graph based, and rumor has it that it's a greedy algorithm. It's way faster and lighterweight in memory usage than some other solutions. And when we compare its performance on metagenomes to other approaches, we get very different answers. So I don't trust it worth a damn. YMMV. But:

It's not Bad Science to use methods you don't understand in detail, as long as you have done sufficient controls. It's definitely Bad Science to use methods you can't understand in detail.

  1. I don't know what your pipeline actually does, either.

Do you record the analysis steps and parameters for each analysis, from start to end? Do you keep track of the versions of software and scripts? Is everything under version control in the first place? And can I look at it and verify what was run? Can it be included in the published record?

It's Bad Science to not have an audit trail of your computational pipeline. It astonishes me that professors who would fire grad students for not keeping a lab notebook don't seem to pay any attention to all of the stuff that goes on in their computer system.

3. The methods and source data you're applying are not well suited to the basic problem.

For example, you're using a manifestly broken transcriptome to do mRNAseq analysis; or you're using default mapping parameters to do gene expression analysis.

It's definitely Bad Science to use off-the-shelf methods without thinking critically about them. (Did you know that apparently the Bowtie folk don't use their own default parameters? Huh.)

  1. You don't understand the biological intricacies of my model system.

For example, you're proposing to apply a standard resequencing mapping pipeline to metagenomic samples with significant unknowns, or with a lot of strain variation; or you're proposing to assemble a highly polymorphic population with an off-the-shelf diploid genome assembly package.

Fo' sure it's Bad Science to do Bad Science.

---

The above four reasons describe my general aversion to all forms of data analysis done on MY data from MY biology. If you want to know why I have no intention of using IMG/M or MG-RAST, for example -- well, there it is. (I can get more specific, just let me know :).

Unfortunately, those of you who are paying attention may have noticed the corner I just backed myself into. NOBODY does Data Science to my standards -- not even myself! I'm a total hypocrite. Oops.

I don't even know of any software packages that keep effective audit trails, for example; I've been looking for a while. (Cue the next corporate e-mail offering me AuditTrail2000... on Windows... sigh) We're good with the "open source stuff" (point 1), and we're feeling our way towards point 3 and point 4 on a bunch of projects, but it's really hard work to do a good job.

My main question is, how particular is all of this to bioinformatics and evolutionary biology, vs all of data-intensive science? Having just attended a workshop on this stuff, I'm skeptical that anyone is really doing an even passable job. I'm not even sure what solutions are on the horizon, at least in bioinformatics. Yikes.

One final note: while I was writing this up, I realized that these perhaps overly high standards may be sourced in my testing work, from software development. I know from firsthand experience just how context- dependent specific (and often very important) outputs can be from tracking down bugs in software; since computational science is at least as hard as software development, I infer the same is true there. Interesting.

--titus

posted at: 19:48 | path: /dec-11 | 1 comments

Tags:


Is Discovery Science Really Bogus?


This blog post was inspired by two recent events.

First, in response to a NY Times article about the "data deluge" affecting biologists, one of my Facebook friends said something like "stop whining about how hard it is to analyze the data and do some good experiments instead!" I vehemently disagreed with this simple statement -- but why??

Second, I used the 4th domain paper by Jonathan Eisen in my computational science class, and we discussed how one would reject or accept the 4th domain with more confidence. Somewhat to my surprise, my own conclusion was that I would ... sequence everything! Yep, just go out and sequence everything I could get my hands on in the tree of life, as well as a bunch of communities from ocean and soil. I was surprised to reach this conclusion (which we can debate on its own merits some other time) because my background is in real science, not "discovery science", and I'd been trained to believe that the discovery-based approach of shaking the trees to see what falls out was kind of unintellectual and unscientific.

Both of these events made me rethink my attitude towards discovery science. The first, because the guy that told us all to stop whining isn't dumb, but I also don't think he's entirely or even mostly right; and the second, because, together with the first, it made me challenge the conventional wisdom in molecular biology that hypothesis driven science is the Right Way.

Hypothesis-driven biology

The way many (most? all?) molecular biologists work is something like this: they develop a theory about some process (physiological or developmental or genomic or whatnot), develop a specific hypothesis or set of hypotheses, and then figure out how to test those hypotheses using controlled experiments. "Hypothesis: objects of near equal mass accelerate equally in a uniform gravitational field; test: drop two objects of equal mass; control for wind resistance." In developmental biology, the molecular field with which I'm most familiar, you might say "I think that the pax3/7 gene is necessary for neural crest specification in these cells at this time, so I'm going to knock it down and see what happens to neural crest." The key point is that you always need to reframe your theory in the form of a fairly specific hypothesis, and then figure out a way to test it. Training students to develop, frame, and test hypotheses is What We Do as professors. When you write grant proposals, you write about why you have developed a specific set of hypotheses (that is, you justify your hypotheses by appealing to prior work and preliminary results), claim that these hypotheses are important or interesting, and then argue vigorously that you are the right person to receive beaucoup bucks to test these hypotheses. Hypothesis-driven research is what we do!

This somewhat dogmatic picture obscures a number of inconvenient truths, however. First of all, many grant agencies (and reviewers) are risk averse, so they prefer to fund things that appear as certain as possible. This means you have to walk the line between crippling your hypotheses by predetermining them with your data, and coming up with an interesting and novel hypothesis -- if you've already tested your hypothesis and you're pretty sure it's right, then it's no longer that interesting to test! Second, no research plan survives contact with reality. So what you really do is sketch out a small extension to a near-certain hypothesis, get funded (admittedly this step is rather rare...), and then discover that your extension is incredibly simplistic and most likely wrong and a dead-end alley. So you end up working on something else completely. That is, you get the grant to work on X and end up working on Z -- not necessarily too far away from X, but not X, either. This leads to a third truth, which is that you get grants because you've been able to make a successful argument, not because anyone expects you to accomplish exactly what is in the grant. The only people that really take your grant proposal literally are the contracts & grants people at your university; the grant administrator and you both understand that this is research, and a real researcher is likely to end up someplace other than where you intended to go, at least in detail.

(I always like to cite Einstein at this point in a conversation: "If we knew what we were doing, it wouldn't be called research, would it?")

What happens when a student confronts this situation? Well, usually students have to write research proposals as part of their qualifying exams, and often students try to stick to those research proposals even when their experiments go awry. I've been part of a bunch of committees where the student will say "ok, so these were our original aims, and here's where I've gone away from them". They don't seem to understand that we don't care (or at least I don't): the real point of the qual is to make sure they know how to frame a hypothesis, and to ensure that they know what a testable hypothesis looks like, smells like, feels like, and tastes like. After that, your research will go where it goes, and that's as it should be. (Aside: my most frustrating (but still positive) moment as a committee member occurred when a student presented a hypothesis and talked about how she was going to test that hypothesis with method X, method Y, data analysis Z, etc. We asked her a bunch of questions and she seemed strangely confident and specific about the expected results. Upon further probing, it turned out that she'd already done the experiments and knew the answers, but thought that the qual needed to be about her hypotheses de novo, and shouldn't take into account actual data she'd generated. WTF??)

Unfortunately, we often stick students with projects where there is no honest way to frame a specific hypothesis. This is true of young labs, which may not have enough specific data to develop a good testable hypothesis for their system and are still casting about for a specific direction to take; and it's increasingly true of established labs that are using next-gen sequencing.

Cue next story: a student of mine was (and still is) part of a collaboration where we were doing bioinformatic analysis of genome-scale disease data. The other professor had funding and generated the data, which was basically sequencing RNA from an affected organ. There literally was no specific hypothesis other than "let's go see how this disease is affecting the spleen transcriptional response." This was then given to my student, who happily pounded away at the data for a few months (making many more trenchant observations about mRNAseq than about the disease, but nonetheless making progress). It came time for his committee meeting, and his committee insisted that he present a hypothesis. He cast about for a while, and finally come up with "there will be a differential transcriptional response to this disease in the spleen." This was nearly disastrous, of course, because it's simply not very specific! Sure, it's a hypothesis, and it's almost certainly true, but it's not specific enough to be useful. So my student nearly failed his committee meeting (note that I was a young(er) prof at this point, and hadn't seen this coming; my fault!) Why am I telling this story, though? Because my collaborator, who had generated the data in a hypothesis free manner, was a member of the committee, and was very disturbed by my student's lack of a hypothesis. Why? Because it was considered very important that our students be doing hypothesis driven science, even though neither he nor I had directed the project that way!

Before I continue on to draw a lesson from this, let me say: I'm not anti-hypothesis in any way. The student is, eventually, going to have to develop a hyp, or he isn't going to get his PhD; he knows that, and I know that. But we were still working on generating hypotheses from the data, and didn't have them ready at hand; developing the hypotheses was actually the first, very significant component of the project. Another point is that the committee was completely unprepared for this. And a third was that the guy who generated the data was so wedded to this hypothesis-driven approach that he basically ended up being hypocritical -- which I point out to him regularly :). (Another component that actually played a smaller part than I'd feared was the computational nature of the research: a certain subset of molecular biologists will vehemently deny that useful work can be done without a pipette man in hand. This either leads to ineffective one-handed typing in bioinformatics, or vociferous arguments among professors -- neither good for committee meetings.)

The lesson I want to draw from this anecdote is simple: hypothesis-driven science is dead!

No, no, not at all. More seriously, I think that as data generation becomes easier in some fields of biology, we should recognize that an extended period of hypothesis generation through discovery-driven approaches may be useful and necessary for many projects. Many biologists may not be any good at this, because they've been honed for decades to focus on moving as quickly as possible to a hypothesis based on a relatively small amount of hand-curated data; but in practice, hypotheses are now cheap (because data is plentiful) and I think we should focus on developing likely hypotheses and winnowing out the dumb 'uns computationally before we ever pick up a pipette man to test 'em. That is, expand the hypothesis-generation and analysis stages so that we're more likely to develop a comprehensive and interesting hypothesis.

About Models, and Model Systems

One of the limitations of the drive to proximal hypotheses is that you need to have tractable systems -- systems in which you can relatively quickly and easily test hypotheses. This leads to using models, and model systems. For example, Drosophila is a great model for genetics and development: it's been used for decades, and has led to at least one set of Nobel prizes for basic understanding of genetics. You can do lots and lots of things with it way more easily than you could imagine doing those same things in a mammalian system: mutagenize, resequence the genome from scratch, do all sorts of crosses in what appear to be a few weeks, etc. etc. But, whether you're interested in biomedical applications, or you're interested in population genetics, or whatnot, it's still just a model, and to build a connection to the broader set of science, you need to analogize the model in various ways. The bigger the field around the model system gets, the less the people feel the need to make the model explicit, and then the junior people forget about it. And so sometimes the model just doesn't apply. One of my favorite examples (just to pick on Drosophila and C. elegans, which are the two biggest invertebrate animal model systems) is from the early days of genomics. We sequenced mouse, and human, and Drosophila, and C. elegans, and saw that there were about 30% more types of genes and gene families in vertebrates. This led to a certain amount of breathless discussion about "the genes that made us vertebrates". Then we sequenced hydra (most emphatically not a vertebrate!), and discovered that it had almost all those gene families. Bang! It turned out that Drosophila and C. elegans were members of a monophyletic group, the Ecdysozoa, which had undergone extensive gene loss! So in some ways, Drosophila and C. elegans are really bad models for vertebrate genomics! They're from a relatively distant branch of the animals, they have small genomes partly because they were chosen for rapid breeding, and there are lots of things that are just different about them. They're still awesome, and they deserve a lot of study, but the history of genetic research on them really shows both the pluses and minuses of model systems: sometimes a model system that's great for one reason is horrible for another.

The same thing happens in ecology and population genetics, it seems to me. There's a lot of mathematical models that are simple and tractable and that let you "test hypotheses" about certain kinds of relationships, but then you have to determine how relevant those models are to reality. People would prefer not to spend that kind of time or effort -- because it's time and effort not spent generating and testing hypotheses. So the connection is made only for a few kinds of systems, which limits the vision of people doing research.

What about cancer?

I think another catalyst that made me think about all of this is the book The Emperor of Maladies, a Pulitzer-prize winning biography of cancer. There you see again and again how hypothesis driven approaches basically failed, while we slowly developed diagnostic tools and (frankly) guessed randomly about how to deal with cancer. Only recently have we started to gain an understanding of exactly what's going on at the genomic and genetic level, but it's still slow to make its way into therapeutic use; chemo -- killing the cancer slightly more quickly than the normal cells -- is still the main treatment, for chrissakes. Do you think we would do that if we had any other option?? Reading the book, the guy who developed the Pap smear (an excellent diagnostic for cervical cancer) did so on guinea pigs, because it was the only way he could detect estrus in guinea pigs -- by scraping the cervix. He spent 20 years trying to find a biomedical use for it! That's not hypothesis-driven science. Epidemiology has probably had a greater effect on cancer treatment than anything else, by tracking down the specific causes of various conditions like lung cancer, long before we were thinking about cellular mutations.

In my class the other day, the one where we talked about the 4th domain work, James Foster from U Idaho made the point that observation in biology used to be called "Natural History". One of the greatest successes of Natural History? Evolution, the greatest explanatory theory in biology, came directly from the synthesis of vast amounts of observation, with no experiment involved. It took decades for Darwin and others to put it together, and decades more for it to be validated in a hypothesis-driven framework (I'm thinking the finches, or the Lenski E. coli experiment, here; there are probably better places to cite that I don't know about).

The Molgula

When my Facebook friend & colleague talked about how we should stop bitching about data processing and start thinking about experiments, I'm pretty sure he meant that people should be better hypothesis-driven scientists. My instinctive reaction to that thesis is that he's not right (nor is he entirely wrong -- hypothesis-driven science is still necessary, just not sufficient!)

One of my current projects is working on a group of sea squirts, the Molgula, that underwent a dramatic morphological change in the larval form: many of the larvae lack tails. We want to know, how did this happen?

To address this question, we went out and generated about 600 million reads of mRNAseq from a variety of larval stages for a tailed sea squirt, a tailless sea squirt, and hybrid crosses between them. This has let us ask which genes are present, what their levels of expression are, and whether there is allele-specific expression of certain genes in one species over another (never mind, just trust me, it's important & interesting to know). In order to analyze this data - which amounts to about 80 GB of DNA, compressed -- we've had to invent a whole new series of data analysis and reduction tools. This is because the Molgula aren't well-studied model systems: they don't have genome sequences available, no large scale cDNA projects have been done on them, and the molecular tools for doing basic probes are still thin on the ground. It was far easier to spend $20k on sequencing and get an answer in a matter of months -- even counting the development of the data analysis tools -- than it was to do anything else.

Are we going to now go out and take our high-throughput data and analyze it and conclude, voila, we know why the tails aren't forming? No, we're not that dumb! But we are developing several early hypotheses based on the data we have, and we're checking to see if they're plausible in the face of tissue-specific gene expression assays (WMISH). Then we'll go and do the hypothesis-driven perturbations to see: is the tail being specified and failing to extend? Or is it not being specified at all?

It's worth pointing out that virtually everything known about tail development in the sea squirts comes from one particular species, Ciona intestinalis, which is now a pretty established model system: genome, database, EST projects, a whole community. The Molgula, however, which look morphologically pretty similar, are about as far away from Ciona (evolutionarily speaking) as you can get and still be a sea squirt. Wouldn't it be fascinating to know how tails develop in them? Well, if we hadn't lucked into some excellent seed funding for the Molgula project and been able to generate and analyze the vast amounts of sequence, we wouldn't be on our way to looking at them -- this kind of study is seen as a fishing expedition, not worthy of being funded.

This is really the problem with hypothesis-driven approaches, and the priority we give them: they focus us on the questions that can be answered fairly quickly and easily, and not necessarily on the big questions. Sometimes it's possible to find a fundable route to those big questions; sometimes not. In the latter case, the questions go unaddressed.

Soil metagenomes

The other big-ass data project I'll bring up is the Great Prairie Grand Challenge, in which the DOE JGI is sequencing literally terabases worth of DNA extracted from midwestern soil. The ultimate goal is to understand the microbial community composition and function.

Do we have any idea how to do that?

Well, the answer is, "not really". The field of metagenomics is still young, and it turns out to be technologically blocked. That is, the diversity of soil is so high that you need to sample it really deeply; but then the depth of sampling yields so much data, that you can't do anything clever with it computationally. This is one of the other focuses of my lab, and it's emphatically a long-term discovery-driven project. We have only a little idea of what we're looking for, and it's likely to be unrecognizable on the first four looks. We'll have to look and think deeply, AFTER solving the data analysis problems (which, again, I think we have. But it was really hard :).

Rumsfeldian science

One of my other favorite citations is that great Rumsfeld quote, about the known knowns, known unknowns, and unknown unknowns (in his case, with respect to invading Iraq -- oops). We know so little about biology that to restrict our gaze to the known knowns, or even to the known unknowns, is foolish.

Look again at this evolutionary tree of life, from Norm Pace's lab. We understand virtually nothing about the vast majority of those organisms. Sure, we can start to get at the commonalities of some aspects of protein composition, cellular organization, and genomics. But who knows what's out there? Certainly not me, and I suspect no one else. We have a long way to go.

To return to the original purpose of this rant, a lot of this "known unknown" and even more of this "unknown unknown" stuff involves looking at vast amounts of data and finding clever ways to grok the structure of the data, filter out stuff we think is uninteresting, and cherry pick the stuff that IS interesting. This is one of my focuses, and it is hard, specialized, time consuming, and wonderfully challenging. To hear other scientists say, dismissively, that we need to learn how to do proper experiments is a bit disheartening, and, even more problematically, rather short-sighted for the field.

Data -- especially the vast amounts of next-gen data starting to come from sequencers -- is usefully "hypothesis neutral". In Timo Hanny's defense of Chris Anderson's theory that "hypotheses are dead" in The 4th Paradigm, he pointed out that surely there is some point where "more" is different from "some". Being able to sensitively look at minor members of communities, or low-expressed genes and isoforms, will inevitably be informative; we shouldn't just discard it as "that useless discovery science stuff".

In conclusion:

A key part of doing good hypothesis driven science is to come up with good hypotheses based on large-scale observations of biological systems. We should respect that initial stage of observing more than we seem to. My graduate advisor, Eric Davidson, told me the famous analogy about scientific practice being similar to a drunk, having dropped his keys in a dark alleyway, looking for them under the street light; while some people spend their career carrying flashlights into dark corners and doing a really detailed search, and others work on the flashlights, I think it's also going to fruitful to turn up the wattage on the street light so that all of those dark corners get illuminated. And we'll need sharp eyes to search all that newly lit territory. DNA sequencing is turning up the wattage; let's develop the methods to find the nifty stuff that we can now see.

--titus

posted at: 17:25 | path: /dec-11 | 5 comments

Tags: , ,