Wed, 04 Jan 2012

What I'm REALLY thinking about when I use your bioinformatics software


If you're like me, we pretend to care about the science in bioinformatics software. But what we really do is try to find reasons not to outright loathe the software -- because, lud knows, there are usually plenty of reasons to hate it.

In no particular order, here are the top 10 things I hate about your bioinformatics software. You know who I'm talking about.

  1. You posted it on SourceForge (and so I can't download the damn thing using a simple URL).
  2. You're not using version control (and hence are not a scientist).
  3. You put _ in the damned file name unnecessarily (that requires a shift key on my keyboard).
  4. fizbam-0.9.3.tar.gz either untars into the current directory, OR a directory named 'fizbam'. Alternatively, you named it fizbam-0.9.3-2011010101010101010101010101020.tar.gz and it untars into THAT monster of a name (and my ls goes off the screen).
  5. You have no README, or, if you do, it's a one-liner that refers to a URL (didn't I already download your damned software?) OR 5a. Your README file is in HTML (less is better than lynx, dontcha know?)
  6. Running 'make' rebuilds everything from scratch every time you run it (seriously?)
  7. There are neither tests nor examples; or, if there are, I can't run 'em, and even if they do run, I have no idea if the results are correct.
  8. The output is in some weird format and/or location (wait, I have to do a find to find the last file written, and then guess as to its format?)
  9. The command line options are poorly labeled and described, use random abbreviations, and/or are sensitive to order (unlike every good command-line parsing library written).
  10. You CaMEl-CaSED your software name so that not even tab completion can figure it out (program names should be all lower-case, as Darwin intended).

---

Post your top ten and send me the URLs... :)

--titus

posted at: 21:57 | path: /jan-12 | 9 comments

Tags: , ,


Tue, 06 Dec 2011

Is Discovery Science Really Bogus?


This blog post was inspired by two recent events.

First, in response to a NY Times article about the "data deluge" affecting biologists, one of my Facebook friends said something like "stop whining about how hard it is to analyze the data and do some good experiments instead!" I vehemently disagreed with this simple statement -- but why??

Second, I used the 4th domain paper by Jonathan Eisen in my computational science class, and we discussed how one would reject or accept the 4th domain with more confidence. Somewhat to my surprise, my own conclusion was that I would ... sequence everything! Yep, just go out and sequence everything I could get my hands on in the tree of life, as well as a bunch of communities from ocean and soil. I was surprised to reach this conclusion (which we can debate on its own merits some other time) because my background is in real science, not "discovery science", and I'd been trained to believe that the discovery-based approach of shaking the trees to see what falls out was kind of unintellectual and unscientific.

Both of these events made me rethink my attitude towards discovery science. The first, because the guy that told us all to stop whining isn't dumb, but I also don't think he's entirely or even mostly right; and the second, because, together with the first, it made me challenge the conventional wisdom in molecular biology that hypothesis driven science is the Right Way.

Hypothesis-driven biology

The way many (most? all?) molecular biologists work is something like this: they develop a theory about some process (physiological or developmental or genomic or whatnot), develop a specific hypothesis or set of hypotheses, and then figure out how to test those hypotheses using controlled experiments. "Hypothesis: objects of near equal mass accelerate equally in a uniform gravitational field; test: drop two objects of equal mass; control for wind resistance." In developmental biology, the molecular field with which I'm most familiar, you might say "I think that the pax3/7 gene is necessary for neural crest specification in these cells at this time, so I'm going to knock it down and see what happens to neural crest." The key point is that you always need to reframe your theory in the form of a fairly specific hypothesis, and then figure out a way to test it. Training students to develop, frame, and test hypotheses is What We Do as professors. When you write grant proposals, you write about why you have developed a specific set of hypotheses (that is, you justify your hypotheses by appealing to prior work and preliminary results), claim that these hypotheses are important or interesting, and then argue vigorously that you are the right person to receive beaucoup bucks to test these hypotheses. Hypothesis-driven research is what we do!

This somewhat dogmatic picture obscures a number of inconvenient truths, however. First of all, many grant agencies (and reviewers) are risk averse, so they prefer to fund things that appear as certain as possible. This means you have to walk the line between crippling your hypotheses by predetermining them with your data, and coming up with an interesting and novel hypothesis -- if you've already tested your hypothesis and you're pretty sure it's right, then it's no longer that interesting to test! Second, no research plan survives contact with reality. So what you really do is sketch out a small extension to a near-certain hypothesis, get funded (admittedly this step is rather rare...), and then discover that your extension is incredibly simplistic and most likely wrong and a dead-end alley. So you end up working on something else completely. That is, you get the grant to work on X and end up working on Z -- not necessarily too far away from X, but not X, either. This leads to a third truth, which is that you get grants because you've been able to make a successful argument, not because anyone expects you to accomplish exactly what is in the grant. The only people that really take your grant proposal literally are the contracts & grants people at your university; the grant administrator and you both understand that this is research, and a real researcher is likely to end up someplace other than where you intended to go, at least in detail.

(I always like to cite Einstein at this point in a conversation: "If we knew what we were doing, it wouldn't be called research, would it?")

What happens when a student confronts this situation? Well, usually students have to write research proposals as part of their qualifying exams, and often students try to stick to those research proposals even when their experiments go awry. I've been part of a bunch of committees where the student will say "ok, so these were our original aims, and here's where I've gone away from them". They don't seem to understand that we don't care (or at least I don't): the real point of the qual is to make sure they know how to frame a hypothesis, and to ensure that they know what a testable hypothesis looks like, smells like, feels like, and tastes like. After that, your research will go where it goes, and that's as it should be. (Aside: my most frustrating (but still positive) moment as a committee member occurred when a student presented a hypothesis and talked about how she was going to test that hypothesis with method X, method Y, data analysis Z, etc. We asked her a bunch of questions and she seemed strangely confident and specific about the expected results. Upon further probing, it turned out that she'd already done the experiments and knew the answers, but thought that the qual needed to be about her hypotheses de novo, and shouldn't take into account actual data she'd generated. WTF??)

Unfortunately, we often stick students with projects where there is no honest way to frame a specific hypothesis. This is true of young labs, which may not have enough specific data to develop a good testable hypothesis for their system and are still casting about for a specific direction to take; and it's increasingly true of established labs that are using next-gen sequencing.

Cue next story: a student of mine was (and still is) part of a collaboration where we were doing bioinformatic analysis of genome-scale disease data. The other professor had funding and generated the data, which was basically sequencing RNA from an affected organ. There literally was no specific hypothesis other than "let's go see how this disease is affecting the spleen transcriptional response." This was then given to my student, who happily pounded away at the data for a few months (making many more trenchant observations about mRNAseq than about the disease, but nonetheless making progress). It came time for his committee meeting, and his committee insisted that he present a hypothesis. He cast about for a while, and finally come up with "there will be a differential transcriptional response to this disease in the spleen." This was nearly disastrous, of course, because it's simply not very specific! Sure, it's a hypothesis, and it's almost certainly true, but it's not specific enough to be useful. So my student nearly failed his committee meeting (note that I was a young(er) prof at this point, and hadn't seen this coming; my fault!) Why am I telling this story, though? Because my collaborator, who had generated the data in a hypothesis free manner, was a member of the committee, and was very disturbed by my student's lack of a hypothesis. Why? Because it was considered very important that our students be doing hypothesis driven science, even though neither he nor I had directed the project that way!

Before I continue on to draw a lesson from this, let me say: I'm not anti-hypothesis in any way. The student is, eventually, going to have to develop a hyp, or he isn't going to get his PhD; he knows that, and I know that. But we were still working on generating hypotheses from the data, and didn't have them ready at hand; developing the hypotheses was actually the first, very significant component of the project. Another point is that the committee was completely unprepared for this. And a third was that the guy who generated the data was so wedded to this hypothesis-driven approach that he basically ended up being hypocritical -- which I point out to him regularly :). (Another component that actually played a smaller part than I'd feared was the computational nature of the research: a certain subset of molecular biologists will vehemently deny that useful work can be done without a pipette man in hand. This either leads to ineffective one-handed typing in bioinformatics, or vociferous arguments among professors -- neither good for committee meetings.)

The lesson I want to draw from this anecdote is simple: hypothesis-driven science is dead!

No, no, not at all. More seriously, I think that as data generation becomes easier in some fields of biology, we should recognize that an extended period of hypothesis generation through discovery-driven approaches may be useful and necessary for many projects. Many biologists may not be any good at this, because they've been honed for decades to focus on moving as quickly as possible to a hypothesis based on a relatively small amount of hand-curated data; but in practice, hypotheses are now cheap (because data is plentiful) and I think we should focus on developing likely hypotheses and winnowing out the dumb 'uns computationally before we ever pick up a pipette man to test 'em. That is, expand the hypothesis-generation and analysis stages so that we're more likely to develop a comprehensive and interesting hypothesis.

About Models, and Model Systems

One of the limitations of the drive to proximal hypotheses is that you need to have tractable systems -- systems in which you can relatively quickly and easily test hypotheses. This leads to using models, and model systems. For example, Drosophila is a great model for genetics and development: it's been used for decades, and has led to at least one set of Nobel prizes for basic understanding of genetics. You can do lots and lots of things with it way more easily than you could imagine doing those same things in a mammalian system: mutagenize, resequence the genome from scratch, do all sorts of crosses in what appear to be a few weeks, etc. etc. But, whether you're interested in biomedical applications, or you're interested in population genetics, or whatnot, it's still just a model, and to build a connection to the broader set of science, you need to analogize the model in various ways. The bigger the field around the model system gets, the less the people feel the need to make the model explicit, and then the junior people forget about it. And so sometimes the model just doesn't apply. One of my favorite examples (just to pick on Drosophila and C. elegans, which are the two biggest invertebrate animal model systems) is from the early days of genomics. We sequenced mouse, and human, and Drosophila, and C. elegans, and saw that there were about 30% more types of genes and gene families in vertebrates. This led to a certain amount of breathless discussion about "the genes that made us vertebrates". Then we sequenced hydra (most emphatically not a vertebrate!), and discovered that it had almost all those gene families. Bang! It turned out that Drosophila and C. elegans were members of a monophyletic group, the Ecdysozoa, which had undergone extensive gene loss! So in some ways, Drosophila and C. elegans are really bad models for vertebrate genomics! They're from a relatively distant branch of the animals, they have small genomes partly because they were chosen for rapid breeding, and there are lots of things that are just different about them. They're still awesome, and they deserve a lot of study, but the history of genetic research on them really shows both the pluses and minuses of model systems: sometimes a model system that's great for one reason is horrible for another.

The same thing happens in ecology and population genetics, it seems to me. There's a lot of mathematical models that are simple and tractable and that let you "test hypotheses" about certain kinds of relationships, but then you have to determine how relevant those models are to reality. People would prefer not to spend that kind of time or effort -- because it's time and effort not spent generating and testing hypotheses. So the connection is made only for a few kinds of systems, which limits the vision of people doing research.

What about cancer?

I think another catalyst that made me think about all of this is the book The Emperor of Maladies, a Pulitzer-prize winning biography of cancer. There you see again and again how hypothesis driven approaches basically failed, while we slowly developed diagnostic tools and (frankly) guessed randomly about how to deal with cancer. Only recently have we started to gain an understanding of exactly what's going on at the genomic and genetic level, but it's still slow to make its way into therapeutic use; chemo -- killing the cancer slightly more quickly than the normal cells -- is still the main treatment, for chrissakes. Do you think we would do that if we had any other option?? Reading the book, the guy who developed the Pap smear (an excellent diagnostic for cervical cancer) did so on guinea pigs, because it was the only way he could detect estrus in guinea pigs -- by scraping the cervix. He spent 20 years trying to find a biomedical use for it! That's not hypothesis-driven science. Epidemiology has probably had a greater effect on cancer treatment than anything else, by tracking down the specific causes of various conditions like lung cancer, long before we were thinking about cellular mutations.

In my class the other day, the one where we talked about the 4th domain work, James Foster from U Idaho made the point that observation in biology used to be called "Natural History". One of the greatest successes of Natural History? Evolution, the greatest explanatory theory in biology, came directly from the synthesis of vast amounts of observation, with no experiment involved. It took decades for Darwin and others to put it together, and decades more for it to be validated in a hypothesis-driven framework (I'm thinking the finches, or the Lenski E. coli experiment, here; there are probably better places to cite that I don't know about).

The Molgula

When my Facebook friend & colleague talked about how we should stop bitching about data processing and start thinking about experiments, I'm pretty sure he meant that people should be better hypothesis-driven scientists. My instinctive reaction to that thesis is that he's not right (nor is he entirely wrong -- hypothesis-driven science is still necessary, just not sufficient!)

One of my current projects is working on a group of sea squirts, the Molgula, that underwent a dramatic morphological change in the larval form: many of the larvae lack tails. We want to know, how did this happen?

To address this question, we went out and generated about 600 million reads of mRNAseq from a variety of larval stages for a tailed sea squirt, a tailless sea squirt, and hybrid crosses between them. This has let us ask which genes are present, what their levels of expression are, and whether there is allele-specific expression of certain genes in one species over another (never mind, just trust me, it's important & interesting to know). In order to analyze this data - which amounts to about 80 GB of DNA, compressed -- we've had to invent a whole new series of data analysis and reduction tools. This is because the Molgula aren't well-studied model systems: they don't have genome sequences available, no large scale cDNA projects have been done on them, and the molecular tools for doing basic probes are still thin on the ground. It was far easier to spend $20k on sequencing and get an answer in a matter of months -- even counting the development of the data analysis tools -- than it was to do anything else.

Are we going to now go out and take our high-throughput data and analyze it and conclude, voila, we know why the tails aren't forming? No, we're not that dumb! But we are developing several early hypotheses based on the data we have, and we're checking to see if they're plausible in the face of tissue-specific gene expression assays (WMISH). Then we'll go and do the hypothesis-driven perturbations to see: is the tail being specified and failing to extend? Or is it not being specified at all?

It's worth pointing out that virtually everything known about tail development in the sea squirts comes from one particular species, Ciona intestinalis, which is now a pretty established model system: genome, database, EST projects, a whole community. The Molgula, however, which look morphologically pretty similar, are about as far away from Ciona (evolutionarily speaking) as you can get and still be a sea squirt. Wouldn't it be fascinating to know how tails develop in them? Well, if we hadn't lucked into some excellent seed funding for the Molgula project and been able to generate and analyze the vast amounts of sequence, we wouldn't be on our way to looking at them -- this kind of study is seen as a fishing expedition, not worthy of being funded.

This is really the problem with hypothesis-driven approaches, and the priority we give them: they focus us on the questions that can be answered fairly quickly and easily, and not necessarily on the big questions. Sometimes it's possible to find a fundable route to those big questions; sometimes not. In the latter case, the questions go unaddressed.

Soil metagenomes

The other big-ass data project I'll bring up is the Great Prairie Grand Challenge, in which the DOE JGI is sequencing literally terabases worth of DNA extracted from midwestern soil. The ultimate goal is to understand the microbial community composition and function.

Do we have any idea how to do that?

Well, the answer is, "not really". The field of metagenomics is still young, and it turns out to be technologically blocked. That is, the diversity of soil is so high that you need to sample it really deeply; but then the depth of sampling yields so much data, that you can't do anything clever with it computationally. This is one of the other focuses of my lab, and it's emphatically a long-term discovery-driven project. We have only a little idea of what we're looking for, and it's likely to be unrecognizable on the first four looks. We'll have to look and think deeply, AFTER solving the data analysis problems (which, again, I think we have. But it was really hard :).

Rumsfeldian science

One of my other favorite citations is that great Rumsfeld quote, about the known knowns, known unknowns, and unknown unknowns (in his case, with respect to invading Iraq -- oops). We know so little about biology that to restrict our gaze to the known knowns, or even to the known unknowns, is foolish.

Look again at this evolutionary tree of life, from Norm Pace's lab. We understand virtually nothing about the vast majority of those organisms. Sure, we can start to get at the commonalities of some aspects of protein composition, cellular organization, and genomics. But who knows what's out there? Certainly not me, and I suspect no one else. We have a long way to go.

To return to the original purpose of this rant, a lot of this "known unknown" and even more of this "unknown unknown" stuff involves looking at vast amounts of data and finding clever ways to grok the structure of the data, filter out stuff we think is uninteresting, and cherry pick the stuff that IS interesting. This is one of my focuses, and it is hard, specialized, time consuming, and wonderfully challenging. To hear other scientists say, dismissively, that we need to learn how to do proper experiments is a bit disheartening, and, even more problematically, rather short-sighted for the field.

Data -- especially the vast amounts of next-gen data starting to come from sequencers -- is usefully "hypothesis neutral". In Timo Hanny's defense of Chris Anderson's theory that "hypotheses are dead" in The 4th Paradigm, he pointed out that surely there is some point where "more" is different from "some". Being able to sensitively look at minor members of communities, or low-expressed genes and isoforms, will inevitably be informative; we shouldn't just discard it as "that useless discovery science stuff".

In conclusion:

A key part of doing good hypothesis driven science is to come up with good hypotheses based on large-scale observations of biological systems. We should respect that initial stage of observing more than we seem to. My graduate advisor, Eric Davidson, told me the famous analogy about scientific practice being similar to a drunk, having dropped his keys in a dark alleyway, looking for them under the street light; while some people spend their career carrying flashlights into dark corners and doing a really detailed search, and others work on the flashlights, I think it's also going to fruitful to turn up the wattage on the street light so that all of those dark corners get illuminated. And we'll need sharp eyes to search all that newly lit territory. DNA sequencing is turning up the wattage; let's develop the methods to find the nifty stuff that we can now see.

--titus

posted at: 17:25 | path: /dec-11 | 5 comments

Tags: , ,


Fri, 11 Mar 2011

My new data analysis pipeline code


First, I write a recipe file, 'metagenome.recipe', laying out my job description for, say, sequence trimming and assembly with Velvet:

fasta_file soil-data.fa

qc_filter min_length=50 remove_Ns=true

graph_filter min_length=400

velvet_assemble k=33 min_length=1000 scaffolding=True

Then I specify machine parameters, e.g. 'bigmem.conf':

[defaults]
n_threads=8

[graph_filtering]
use_mem=32gb

[velvet]
needs_mem=64gb

And finally, I run the pipeline:

% ak-run metagenome.recipe -c bigmem.conf

If I have cloud access (and who doesn't?) I can tell the pipeline to spin up and down nodes as needed:

% ak-aws-run metagenome.recipe -c bigmem.conf

(Bear in mind most of these tasks are multi-hour, if not multi-day, operations, so I'm not too worried about optimizing machine use and re-use.)

Hadoop jobs could be spawned underneath, depending on how each recipe component was actually implemented.

As for testing reproducibility of pipeline results, which is the short-term motivation here, I can store results for regression testing with later versions:

% ak-run metagenome.recipe -c bigmem.conf --save-endpoint=/some/path

and then compare:

% ak-run --check-endpoint=/some/path

---

Now, does anyone know of a package or packages that actually do this, so I/we don't have to write it??

See texttest and ruffus for some of my inspiration/interest.

--titus

posted at: 06:56 | path: /mar-11 | 3 comments

Tags: , ,


Thu, 23 Dec 2010

What's in it for me? Thoughts on open science.


If you've been under a rock (or indulging in arsenic yourself), you've heard about NASA's "arsenic" article, claiming the discovery of a microbial species that can substitute arsenate for phosphate. The paper was pre-announced via a press conference that then announced the results.

Immediate blogtastrophe! The paper was critically reviewed in the blogosphere by a lot of people; I'm particularly fond of Rosie Redfield's Moreover, NASA has not covered itself in glory in its responses, claiming that blog reviews are not worth responding to, even when done by practicing scientists.

The Guardian has an article by Martin Robbins summing up much of the ensuing commentary, which boils down to some variation on "this paper should not have been published in Science", or "reviewer fail".

I found the Guardian article interesting, and I wanted to particularly comment on one of the concluding paragraphs:

At almost every stage of this story the actors involved were collapsing under the weight of their own slavish obedience to a fundamentally broken... well... 'system' is the right word, but I find myself toying with 'ideology'.

It's almost an article of faith online (blogs, twitter, yada) that many venerable academic institutions, including peer review and the whole scientific publication model, are basically broken . I don't disagree, although I'm hardly an expert. But I do have a comment, or rather a question, that I think is pertinent to the discussion of how to improve science through blogging, online peer review, and other methods of openness.

What's in it for me?

More generally, why should Dr. Jane Random Researcher invest any time or effort into blogging (and responding to other bloggers), writing good software, or anything else? What good does it do her? Is online networking just another (still rather poor) social networking tool?

I don't think you can use the arsenic paper as an argument for peer review in the blogosphere. The only reason we noticed the arsenic paper was that it was a .1-percenter: fascinating results with significant implications, hyped to high heaven by NASA, and reviewed quite visibly by a number of serious scientists. If the paper hadn't been publicized as heavily, little or none of the online stuff would have matter. As it is, I can virtually guarantee that the first author is not going to be able to ride the glory of a Science paper into a faculty position, because everyone knows about the controversy now. But that's this paper, not the other thousands that will be published this year (hundreds in Science alone).

This is the problem with the online world for scientists: there's no real systematized incentive to any of this online stuff. And that makes it really tough. I'm going through Reappointment right now (that's where you fill in a lot of little boxes tallying your papers, grants, teaching, and other stuff, so that your university can decide if you're worth keeping on for another few years). Nowhere on there is there a place for "influential blog posts" -- how would you measure that, anyway? Same with software -- I listed my various software releases on the "scientific products" page of the form, and have since been asked to describe and discuss the impact of my software. Since I don't track downloads, and half or more of the software hasn't been published yet and can't easily be cited, and people don't seem to reliably cite open source software anyway, I'm not sure how to document the impact.

So, why do I release software, or blog? Well, I do what little I do because I like it. Personally, I'm ideologically bent towards openness: open source, open science, open review, etc. And I'm willing to spend some of my time investing in it, writing about it, and otherwise trying to practice it. And I've managed to make it work for me reasonably well, at least so far; more on that in future blog posts.

Having an identifiable incentive structure, however, is important. If you want people in general to change, you need to be able to show them that there's some gain in it for them - not monetary (no scientist I know is in it for the money) but economic in the academic sense. This boils down to cold hard grant cash & publications. Why? Because that's what the hiring, reappointment, promotion and tenure committees care about, so that's how you get and keep jobs -- and it's awfully hard to do research without a job.

The notion that publicizing your science leads to scientific fame and fortune is silly. The idea that additional citations is "a tangible benefit" is nonsense. The only "tangible benefit" that junior scientists care about is more time, more grants, and more publications. Writing blogs publicizing your own research is generally not going to help with that; rather, it's going to reduce the time you get to spend on doing science.

So how do you go from "online" to "grants and pubs"?

I don't know of any robust mechanisms for converting online reputation, from blogging or software release, into academic grants or publications. There are a few weak venues for software, like the NIH Software Maintenance grant program or the NSF Software Infrastructure for Sustained Innovation program, and some journals do exist to support the publication of software, but these haven't made much of an impact yet, and -- just as importantly -- seem to be largely uncoupled from software quality, at least as far as I can tell. Sean Eddy wrote a great article touching on the need for better software developer incentives that is particularly worth reading.

So why write software? Right now you only write software for the purpose of doing your own research, and there's very little incentive to make it public, much less make it good. It's rarely if ever peer reviewed, and the number of people using your software is (at best) useful for convincing grant reviewers that maybe you had some useful ideas a few years back.

In light of all of this, I'm very pleased to announce a new journal, Open Research Computation, or ORC. ORC is a journal for those of us who, for one reason or another, spend a lot of time working on the software. Cameron Neylon neyls it in his blog post:

Computation lies at the heart of all modern research. ... Open Research Computation is a journal that seeks to directly address the issues that computational researchers have. ... The primary consideration for publication in ORC is that your code must be capable of being used, re-purposed, understood, and efficiently built on.

I'm extra-specially-pleased to be on the board of editors, not least because so far it seems like this journal is trying to break significant new ground. Our ed board discussions so far have included discussions on how to properly "snapshot" version control repositories upon publication of the associated paper (easy for DVCS... not so much for svn) and considerations for "repeat" publishing of significant new software versions, as the software matures, in order to help encourage people to actually update and release their software.

This new journal isn't a panacea, of course. It's going to take 3-5 years, or even more, to make a real impact, if it ever does. But I'm enthusiastic about a venue that speaks to a major theme of my own scientific efforts -- responsible computing -- and that could help in the struggle to place responsible computing more squarely in the scientific focus.

I also hope that this kind of journal -- providing incentives for more online interaction, if only in software -- will help convince scientists that online interaction is a Good Thing. At the least it's one more brick in the road.

--titus

p.s. Merry Christmas, all!

posted at: 15:44 | path: /dec-10 | 1 comments

Tags: ,


Thu, 09 Dec 2010

(Some) Principles of Computational Science


I'm just finishing up my Computational Science for Evolutionary Biologists course, and I'm finding it tricky to come up with a good high-level summary of what I would like them to take away. As you can see from the class notes they've done some reasonably neat stuff with Digital Life and (separately!) next-gen sequence analysis, but the class has been somewhat random in its topics and train of thought.

Anyway, for the final class I decided I'd go slide by slide through a number of principles that they should apply if and when they find themselves doing computational science. In each case I can point to class exercises and homeworks that illustrate the points, which I think means I haven't totally failed... ;)

Anyway, here's what I have so far:


13 Principles of Computational Science:

1) Computational science is just like any other science: don't trust it if you don't understand it.

Seriously. Computers aren't magic, and computational jargon isn't any more meaningful than any other jargon.
  1. The entire chain of evidence matters.

Keep close track of the raw data; the analysis source code; and the parameters used at each stage of data generation, processing and summarization.

Corollary: Make your raw data available. To do otherwise is just silly.

  1. If it's not automated, it's crrrrrap
As soon as there's some manual step in your pipeline, you've lost track of what you're doing. You may do it differently, or not at all, or incorrectly. And you'll never know. You'll just get different results. Sometimes.
  1. Use version control.
If it's neither raw data (backed up!) nor generated data, put it in version control.
  1. Using other people's software to do science is hard.

They probably had some other use in mind that doesn't fit your needs, but you're going to try to adapt it anyway, aren't you? Good luck with that.

Corollary: using your own software to do science, 2 years after you wrote it, is hard -- because you're not you any more. (Remember, you can never step in the same stream twice.)

  1. No software is trustworthy.
Until you understand your software stack intuitively, have obsessed over parameter choices, and have locked down your software behavior with automated tests, don't trust it. After that, you can grudgingly extend some minimal trust to it, at least until the next version is released.
  1. Computation is not science.
Science is science. Computation may be one of the ways in which you do science.
  1. Hypotheses are good.

It's virtually impossible to analyze data without some kind of hypothesis in mind.

Corollary: Each hypothesis is only a starting point. It's probably wrong, so don't get too attached to it.

  1. More data is not necessarily less confusing.

The more data you have, the harder it can be to get a clean signal. Statistics help here, unless of course you have an unknown systematic bias in your data.

Corollary: You have an unknown systematic bias in your data.

  1. Interdisciplinary research is hard.

You need to be an expert in multiple fields, each with its own special techniques, lingo, and "commonly understood" shibboleths. Proper hypothesis testing involves mastering the first two; publication may depend on avoiding the latter.

Corollary: computational science is implicitly interdisciplinary, hence hard. (If it were easy, we wouldn't need smart people like you to do it, right?)

  1. A lot of computing is just details.
There's very little magical about computing. An awful lot of it is just more details to remember. Running software, gathering the results, processing them, plotting them, tweaking parameters, etc.
  1. Look at your data.
Look at your data, and your results, in as many ways as possible. You'll often be surprised by what's actually in there.
  1. Above all, tell a story.
Nobody is interested in just graphs. If you don't have an interesting story, dig deeper.

I know, somewhat scattered. Any more thoughts, or pointers to similar lists?

thanks,

--titus

p.s. I plan to finish up with my (IMO very underappreciated) principles of How to be a Successful Computational Scientist, summarized here:

  1. Never show them your data.
  2. Do not, under any circumstances, communicate clearly.
  3. Never release your source code, either.
  4. Judge computational science by results, not quality.
  5. Use as much data as possible.

Then they get to fill out evaluations. Whee!

posted at: 21:45 | path: /dec-10 | 2 comments

Tags: ,