Fri, 03 Sep 2010
Open Science, and Risk/Benefit Analysis
In thinking about open science and open communication about science, I've always been frustrated by the people who claim that the risks outweight the benefit. Their arguments seem sound if you buy into a certain kind of logic (the creationists will try to twist whatever you say! the climate change deniers will use your words in ways you did not intend! people will steal your research! you cannot communicate openly about what you're doing!) but I could never pin down why I felt that way. I had a eureka moment about it today, though.
When someone tells me that (for example) we should not make all BEACON research proposals fully public because they will be misinterpreted by creationists and used against us, they are saying this: in their personal opinion, the identified risks outweight the identified benefits. They already know (and I agree that this will happen) that people will take the BEACON-funded study of -- for example -- some fascinating tailless ascidians as a scientific boondoggle, an excuse for a trip to France that won't result in anything but more incomprehensible literature about chordate origins. And they can't imagine that, without careful shaping of the message and management of the public image, this will not happen. Since there's no particularly obvious benefit to posting them publicly, the risks (of misinterpretation) outweigh the benefits (of some nebulous "open science" thingy). So halt! the publication.
Same arguments apply to climate change (but they'll just misuse/misinterpret the data!) and open science in general (but someone will just steal my data/ ideas/...!)
This is fundamentally a failure of imagination. It is doing a risk analysis based on your worst fears, and neglecting a benefit analysis of your wildest hopes.
For examples:
In the case of BEACON, we have a sprawling collection of 100 faculty spread across 5 institutions. I have literally no idea what more than half of them are doing. Wouldn't it be great if I could do a text search of their proposals, and even better if I could stumble across a BEACON colleague in a Google search on some topic or other? Or if we could attract students that didn't even know they were interested in "evolution in action", but came to our Web site based on Google's indexing of a rich array of research projects and then found themselves hooked?
What about the climate change skeptic (or agnostic) who suddenly gets a chance to sit down and look at all the data and can conclude that hey, this is actually really complicated? And it's probably not as simple as the skeptics claim? (Aside: I'm unbelievably pissed at the climate change community for the idiocy of their current closed-ness.)
And what about the collaborators that I could get (and am getting) from posting about some of our projects? In the worst case, I post about things and no one pays any attention; in the best case that I can think of, I make connections and establish cred that enables future collaborations, publications, and grant opportunities. (This is already happening.)
At the heart of science is an ethos that has to include openness in order to work properly. Any constriction in the flow of ideas and the interchange of opinions is a block in the very lifeblood of science itself. If we indulge those who argue against free communication, we are preventing not only some imagined negative consequences, but all of the happy coincidences that are beyond our limited imaginations.
So turn on, tune in, and don't drop out.
--titus
posted at: 20:55 | path: /sep-10 | 3 comments
Thu, 26 Aug 2010
Galileo, Open Science, and History
I'm a big believer in open science -- see this great polemic over at Mendeley for a good read -- but it's always interesting to think about how such things as "data release" can be perverted by clever scientists. I'm currently in France working on some ascidians with Billie Swalla -- more on that later -- and we've been talking about what data we plan to release, and how. During these talks (leisurely conducted over cafe au lait and chausson pomme, of course!) Billie brought up an interesting historical parallel.
The story, as I understand it, is this: when Galileo Galilei first looked through a good quality telescope and discovered Jupiter's moons, nobody believed him. Since he was the only person able to make such good telescopes, he actually made and distributed them to other scientists -- not just as a profitable sideline, but so that the other scientists could confirm his observations!
One could see this a first step towards "open science": in order to reproduce Galileo's observations, astronomers had to have a telescope that only Galileo could make. So Galileo had to make telescopes and send them out, thus allowing others to both reproduce his observations and build upon them.
The story takes on a different aura, however, when you realize that Galileo could have just given out the actual manufacturing instructions for the telescopes, but didn't. Two possible reasons are money (he made money selling the telescopes to others) and scientific miserliness: he didn't want others to get credit for building on his results. As long as he withheld the details necessary to reproduce his instruments, he ensured that no one could build on his results, and that he would have preeminence in astronomy. (The parables between this and source code are uncanny, no?)
It was quite a balancing act. To quote from Dr. Biagioli's "Replication or Monopoly" (pdf here),
"His primary worry was not that some people might reject his claims, but rather that those able to replicate them could too easily proceed to make further discoveries on their own and deprive him of future credit (Galilei 1989, 17). Consequently, he tried to slow down potential replicators to prevent them from becoming competitors. He did so by not providing other practitioners access to high-power telescopes and by withholding detailed information about how to build them.
But as important as it was for Galileo to keep his fellow astronomers in the dark, such negative tactics alone would not have allowed him to gain credit from his discoveries and move from his post at the university of Padua to a position at the Medici court in Florence as mathematician and philosopher of the grand duke - goals clearly on his mind in 1610.He needed proactive tactics as well. First, he did his best to make sure the grand duke saw the satellites of Jupiter (which Galileo had named "Medicean Stars") by sending detailed instructions to Florence on how to conduct these observations, and then by going to court himself at Easter time (Galilei 1890- 1909, X:281, 304). Second, through the prompt publication of the Sidereus nuncius in March of 1610 he tried to establish priority and international visibility - resources he needed to impress his prospective patron, not just the republic of letters.
The Nuncius was carefully crafted to maximize the credit Galileo could expect from readers while minimizing the information given out to potential competitors."
Here you can see calculation as fine as any modern professor, trying to decide if they should release all their data, or only some of it; all of their source code, or only a crippled version.
Billie also observes that one potential irony in this story is that Galileo, by so strongly taking sole credit for his discoveries, made himself a clear target for the Catholic Church...
An even more pernicious approach, seeking priority while avoiding embarrassment by publishing hashes (well, anagrams ;) of formulae or observations, was common in the 17th century. In The Newton Handbook, by Derek Gjertsen, Gjertsen writes:
"It was not uncommon for seventeenth-century scientists to record their more valued results in the form of anagrams. Thus, Galileo published his discovery in 1610 of the phases of Venus in a thirty-five letter anagram, Huygens announced his 1656 observation that Saturn was surrounded by a ring in a sixty-three letter anagram, while, in England, Robert Hooke and Christopher Wren resorted to similar stratagems. The advantages of the ploy are obious. Priority was established, yet nothing was given away to potential rivals. If, by chance, the work failed to stand up to further analysis it could be quietly forgotten without the embarrassment public failures tended to incur."
One can only wonder how many one-shot awesome Science and Nature papers, using software that was and remains unavailable, are entirely unreplicable or otherwise uninteresting -- for example, I like to pick on one of Eran Segal's publications, because it's so neat and yet very very difficult to replicate without source code. (A colleague is trying.)
Compare this to the recent discussion of the (leaked) P != NP proof, now shown to be erroneous - see, e.g., Greg Baker's blog post, P != NP. Now this is the way science is supposed to work! Quick, thoughtful commentary by experts, highlighting potential problems with your work -- and allowing or enabling others to build off of it.
It's clear to see that by withholding the manufacturing instructions, Galileo may well have held back astronomy as a whole. And by publishing their equations in anagram form, it's likely that Newton and the others did damage to science as a whole.
Today, intellectual reputations like that are in some ways less important (at least in my bottom-feeding scientific world). Publications and citations are more important, since they're measurable by Promotion & Tenure committees. I (and probably many other scientists) are continually worrying about the line between publishing good stuff that enables citations, and giving away all of our future research directions. It takes a real act of faith to throw yourself off the cliff and offer up your latest & greatest source code and data to the world, in the hopes that somehow the resulting "usefulness" will provide lift to your career. We'll see how that goes: road kill? Or tenure?
Back to Galileo -- I think the Galileo example is why, as wonderful as the Panton Principles are for data, for truly open science it's critical to provide not only the raw data, but the source code used to do the analysis. And not only the source code, but useful source code: documented and tested source code [1]. To do anything else would be the equivalent of selling telescopes while withholding the manufacturing instructions that would let others build on your own ideas.
Interesting stuff to think about! Now, back to science...
--titus
[1] Yeah, I realize that most scientific source code probably isn't documented or tested. Draw your own conclusions there ;).
posted at: 14:07 | path: /aug-10 | 3 comments
Wed, 19 May 2010
The grim future for sequencing centers
In conversation with a colleague the other day, I found myself making a surprising prediction: the age of the big sequencing centers (Broad Institute, WUSTL, Baylor, DOE JGI, etc.) is coming to an end. In 5 years they will no longer exist.
This prediction is obvious in hindsight.
That is all.
Hah! No, seriously, I've had a number of interactions with sequencing centers over the last decade, and I feel that many of them are failing to make the transition from hugely-funded centers containing lots of cloning expertise and bajillions of ABI Sanger sequencing machines, to centers of genome expertise and analysis. The new reality of holy-cow-everyone-can-sequence-whatever-they-want, brought on Roche 454, Illumina GA, ABI SOLiD, and soon Pacific Biosystems, is driving this. It is now possible to sequence entire animal genomes in private facilities funded by single-investigator grants, which replaces the primary raison d'etre of big sequencing centers... so what next?
The new challenge of sequencing is in assembly and analysis of the data, and I think everyone is just overwhelmed here. Certainly when I talk to people at sequencing centers, they are capable of generating far more new sequence than they are of assembling or analyzing the sequence they generated last week. For example, the latest lamprey genome assembly was done by Jeramiah Smith in Chris Amemiya's lab, not by WUSTL; and the basic gene set is being constructed by Carson Holt in Mark Yandell's lab in Utah, not by ENSEMBL. The wait time to get into the assembly and analysis queues, and the iteration time needed to integrate new mRNAseq data into the gene set, is simply too great at the centers. Analysis of a large soil metagenomics project (200 gb and counting!) in collaboration with the JGI is running into machine access issues: none of us have quick access to machines capable of running the analyses quickly, although I appear to be the closest because of the MSU HPC.
Contrast this situation with other examples: for example, my recent trip to Mississippi State, where I had a great conversation with a graduate student who is assembling a brown mold genome, all on her own, on a lab machine, with no prior computational experience. Or some friends at Caltech, who have sequenced, assembled, and analyzed both the genome and transcriptome of a worm -- all on their own, with no center involvement. I mean, these people are all ridiculously smart and competent, but I think there are a lot of such people in academia. They just needed cheap sequencing to challenge them!
I wish I could blame the centers for lack of vision or something, but honestly I think they're just the biggest targets for everyone at the moment. People are used to the "mainframe model" of sequencing, where you go to the sequencing center with your genome in hand and beg the high poobahs to sequence, assemble, and annotate it for you; but their funding for computer power and analysis hasn't kept up with the sequencing bonanza (nor could it have), so now they are simply the most visible people failing to keep up with analysis. Unfair but whatcha gonna do?
Are there centers that are keeping up? It's hard for me to say, since I'm not in the rarified bajillion-dollar-PI meetings (note: I'm available for such meetings, folks; I bring 20 years of computational experience, a corresponding deep cynicism, and 10 years of bioinformatics to the table, plus a taste for expensive scotch. Reserve me today!). But I note that the Beijing Genome Institute has a distressing habit of publishing "firsts", including the short-read Panda genome paper and a Human Microbiome Project. I have concerns about their long-term viability but that will have to wait for another blog post...
OK, so what's the future, mr. smarty pants? Damned if I know. Paul Sternberg has a great quote that is my touchstone, though: the biggest, most exciting advances come from the sharpshooter on the hill rather than the army toiling across the plain. I've never been excited by large collaborations, which tend to get embroiled in management issues and politics; while there are some places (like HPCs) where centralization is good, lots of individual investigators are much more likely to generate the diversity of approaches that I think we need.
And did I mention training? Whoops, so silly of me to forget that.
Regardless, I think we're in for a wild and wooly ride on the next-gen sequencing train, and the next few years should be incredibly exciting. It's great to be a (computational) biologist!
--titus
posted at: 08:23 | path: /may-10 | 2 comments
Mon, 17 May 2010
My Data Management Plan - a satire
Dear NSF,
I am happy to respond to your request for a 2-page Data Management Plan.
First of all, let me say how enthusiastic I am that you have embraced this new field of "large scale data analysis". Ever since I started working with large Avida data sets in 1993, then with large meteorological data sets in 1995, and then again with large sequence data sets in 1999, I have seen the need for a systematic plan to manage the data. It is nice to see NSF stepping up to the plate in such a timely manner, and I am happy to comply.
Now, as to my actual data management plan, here is how I plan to deal with research data in the future.
I will store all data on at least one, and possibly up to 50, hard drives in my lab. The directory structure will be custom, not self-explanatory, and in no way documented or described. Students working with the data will be encouraged to make their own copies and modify them as they please, in order to ensure that no one can ever figure out what the actual real raw data is.
Backups will rarely, if ever, be done.
When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.
Did I mention the click-through license? "You are provided this data for the sole purpose of reproducing our published results. Any attempt to publish your own analyses of this data will be rejected, if necessary during the anonymous review process, by pointing out all of the data cleanup steps you forgot to do correctly in your analysis. (We don't remember all of them ourselves, but there sure were a lot!) Give up now."
We will provide a short note -- in a Word document -- detailing the licensing restrictions, as above.
We will make sure that any CSV files we do eventually produce will have format errors, such as missing or extra commas. They will also be encoded in ISO 8859-16, "by accident".
On the off chance that we do choose to provide the source code, it will be in a file named 'source.tar.gz' that unpacks in to the current directory. There will be no explanation of contents, instructions on how to run it, or any enabling information -- it was hard to write, and it should be hard to run! Old, patched, or otherwise impossible-to-obtain versions of Redhat Linux, Perl 5, and associated CPAN libraries will be required before the code runs, even if it doesn't actually use any of them. No source code documentation will be present, of course -- we don't need it ourselves, after all! Automated tests will also not be present (we don't have any of those, either). New versions of the code will be published under the identical file name, with no indication of what changes were made. (We'll be sure to use mixed DOS and Unix EOL editors for our files, so 'diff' won't work to figure out what has changed.)
Note, we didn't use a version control system, either. Or if we did, we made sure to use svn branching and merging profligately, with extremely obscure commit messages (our main programmer only speaks Chinese, so that's how she enters her commit notes. Wouldn't have it any other way). And our repository is not publicly available - you have to beg for permission. Note, I only answer e-mail on every other Tuesday.
Any design notes on the data analysis are in our private e-mail, and we will fight to the death -- up to and including ignoring FOIA requests -- to prevent you from obtaining them.
Meanwhile we will continue publishing exciting sounding (but irerproducible) analyses, and submitting grants based on them, because that's the only thing that the reviewers care about.
sincerely yours,
--titus
(representing every computational scientist in the world.)
posted at: 08:18 | path: /may-10 | 6 comments
Mon, 22 Feb 2010
BEACON funded: $25m / 5 years == awesome
The National Science Foundation just announced that the BEACON Science and Technology Center centered at Michigan State University was just funded. BEACON stands for "Bio/computational Evolution in Action Consortium" - you can check out the Web site here.
In my own nutshell, BEACON is focused on studying the evolution of organization across multiple scales -- from genomic and cellular, to multicellular, to inter-multicellular (a.k.a. social) -- using techniques from experimental evolution, modeling, and digital life systems.
BEACON is a project nucleated by a long-time collaboration between the Lenski Experimental Evolution Lab and the Devolab, parts of which grew out of a summer undergraduate research project (Avida!) that Charles Ofria and I did under Chris Adami's supervision in 1993.
I feel old.
The practical consequences are pretty cool.
First, it means that MSU (and our partner institutions, too -- see below) has money explicitly for supporting students doing really sexy interdisciplinary work combining computation and biology. This is the kind of work that has been reasonably hard to find funding for, especially as it gets less and less connected to, ahem, reality. So we're looking for really awesome students that don't fit in a nice, neat academic box. (How often do you hear that?? ;)
Don't like Michigan? Well, that's fine -- BEACON is a collaboration between MSU, U. Idaho, UT Austin, UW Seattle, and North Carolina A&T. Drop me a line and I can put you in touch with PIs at your favorite graduate school.
It also means that I am being recruited to teach a course on bringing biologists to computational science. This should have positive effects on the state of the Software Carpentry notes, for one. It also means more biologists being brought into the light of Python, for another. Good? I think so ;)
Finally, it means I will probably be thinking about an even wider range of research and research activities in my lab. If you're thinking about starting grad school in 2011, check out BEACON in general and my lab in particular -- I'm interested in
- evolution of gene regulation in artificial systems
- understanding evolutionary signals of information gain in genomes
- evolution of vertebrate complexity
and more.
--titus
posted at: 08:34 | path: /feb-10 | 1 comments