Fri, 21 May 2010
Help! Help! Class notes site?
So, I'm running this summer course and I am trying to figure out how to organize the notes for students. I'd like to mix curriculum-specific notes ("here's what we're doing today, and here are some problems to work on") with tutorials (material independent of a single course, like "here's how to transfer files between computers" or "here's how to parse CSV files"), and allow students to search the documents, annotate them in their Web browser, search the annotations, and perhaps even do public or private bookmarking and tagging. The ability to edit the primary content in something other than a Web GUI would be really, really nice, too -- that way I can write in something like ReST and then upload into the system.
(This is a system I could write myself, but that's kind of silly, dontcha think?)
It should also be lightweight, reasonably mature, easy to set up, and (preferably) written in Python, although I'm willing to compromise on the last simply because I'm desperate.
Pointers, comments, suggestions welcome!
--titus
posted at: 08:22 | path: /may-10 | 7 comments
Wed, 19 May 2010
The grim future for sequencing centers
In conversation with a colleague the other day, I found myself making a surprising prediction: the age of the big sequencing centers (Broad Institute, WUSTL, Baylor, DOE JGI, etc.) is coming to an end. In 5 years they will no longer exist.
This prediction is obvious in hindsight.
That is all.
Hah! No, seriously, I've had a number of interactions with sequencing centers over the last decade, and I feel that many of them are failing to make the transition from hugely-funded centers containing lots of cloning expertise and bajillions of ABI Sanger sequencing machines, to centers of genome expertise and analysis. The new reality of holy-cow-everyone-can-sequence-whatever-they-want, brought on Roche 454, Illumina GA, ABI SOLiD, and soon Pacific Biosystems, is driving this. It is now possible to sequence entire animal genomes in private facilities funded by single-investigator grants, which replaces the primary raison d'etre of big sequencing centers... so what next?
The new challenge of sequencing is in assembly and analysis of the data, and I think everyone is just overwhelmed here. Certainly when I talk to people at sequencing centers, they are capable of generating far more new sequence than they are of assembling or analyzing the sequence they generated last week. For example, the latest lamprey genome assembly was done by Jeramiah Smith in Chris Amemiya's lab, not by WUSTL; and the basic gene set is being constructed by Carson Holt in Mark Yandell's lab in Utah, not by ENSEMBL. The wait time to get into the assembly and analysis queues, and the iteration time needed to integrate new mRNAseq data into the gene set, is simply too great at the centers. Analysis of a large soil metagenomics project (200 gb and counting!) in collaboration with the JGI is running into machine access issues: none of us have quick access to machines capable of running the analyses quickly, although I appear to be the closest because of the MSU HPC.
Contrast this situation with other examples: for example, my recent trip to Mississippi State, where I had a great conversation with a graduate student who is assembling a brown mold genome, all on her own, on a lab machine, with no prior computational experience. Or some friends at Caltech, who have sequenced, assembled, and analyzed both the genome and transcriptome of a worm -- all on their own, with no center involvement. I mean, these people are all ridiculously smart and competent, but I think there are a lot of such people in academia. They just needed cheap sequencing to challenge them!
I wish I could blame the centers for lack of vision or something, but honestly I think they're just the biggest targets for everyone at the moment. People are used to the "mainframe model" of sequencing, where you go to the sequencing center with your genome in hand and beg the high poobahs to sequence, assemble, and annotate it for you; but their funding for computer power and analysis hasn't kept up with the sequencing bonanza (nor could it have), so now they are simply the most visible people failing to keep up with analysis. Unfair but whatcha gonna do?
Are there centers that are keeping up? It's hard for me to say, since I'm not in the rarified bajillion-dollar-PI meetings (note: I'm available for such meetings, folks; I bring 20 years of computational experience, a corresponding deep cynicism, and 10 years of bioinformatics to the table, plus a taste for expensive scotch. Reserve me today!). But I note that the Beijing Genome Institute has a distressing habit of publishing "firsts", including the short-read Panda genome paper and a Human Microbiome Project. I have concerns about their long-term viability but that will have to wait for another blog post...
OK, so what's the future, mr. smarty pants? Damned if I know. Paul Sternberg has a great quote that is my touchstone, though: the biggest, most exciting advances come from the sharpshooter on the hill rather than the army toiling across the plain. I've never been excited by large collaborations, which tend to get embroiled in management issues and politics; while there are some places (like HPCs) where centralization is good, lots of individual investigators are much more likely to generate the diversity of approaches that I think we need.
And did I mention training? Whoops, so silly of me to forget that.
Regardless, I think we're in for a wild and wooly ride on the next-gen sequencing train, and the next few years should be incredibly exciting. It's great to be a (computational) biologist!
--titus
posted at: 08:23 | path: /may-10 | 2 comments
Mon, 17 May 2010
My Data Management Plan - a satire
Dear NSF,
I am happy to respond to your request for a 2-page Data Management Plan.
First of all, let me say how enthusiastic I am that you have embraced this new field of "large scale data analysis". Ever since I started working with large Avida data sets in 1993, then with large meteorological data sets in 1995, and then again with large sequence data sets in 1999, I have seen the need for a systematic plan to manage the data. It is nice to see NSF stepping up to the plate in such a timely manner, and I am happy to comply.
Now, as to my actual data management plan, here is how I plan to deal with research data in the future.
I will store all data on at least one, and possibly up to 50, hard drives in my lab. The directory structure will be custom, not self-explanatory, and in no way documented or described. Students working with the data will be encouraged to make their own copies and modify them as they please, in order to ensure that no one can ever figure out what the actual real raw data is.
Backups will rarely, if ever, be done.
When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.
Did I mention the click-through license? "You are provided this data for the sole purpose of reproducing our published results. Any attempt to publish your own analyses of this data will be rejected, if necessary during the anonymous review process, by pointing out all of the data cleanup steps you forgot to do correctly in your analysis. (We don't remember all of them ourselves, but there sure were a lot!) Give up now."
We will provide a short note -- in a Word document -- detailing the licensing restrictions, as above.
We will make sure that any CSV files we do eventually produce will have format errors, such as missing or extra commas. They will also be encoded in ISO 8859-16, "by accident".
On the off chance that we do choose to provide the source code, it will be in a file named 'source.tar.gz' that unpacks in to the current directory. There will be no explanation of contents, instructions on how to run it, or any enabling information -- it was hard to write, and it should be hard to run! Old, patched, or otherwise impossible-to-obtain versions of Redhat Linux, Perl 5, and associated CPAN libraries will be required before the code runs, even if it doesn't actually use any of them. No source code documentation will be present, of course -- we don't need it ourselves, after all! Automated tests will also not be present (we don't have any of those, either). New versions of the code will be published under the identical file name, with no indication of what changes were made. (We'll be sure to use mixed DOS and Unix EOL editors for our files, so 'diff' won't work to figure out what has changed.)
Note, we didn't use a version control system, either. Or if we did, we made sure to use svn branching and merging profligately, with extremely obscure commit messages (our main programmer only speaks Chinese, so that's how she enters her commit notes. Wouldn't have it any other way). And our repository is not publicly available - you have to beg for permission. Note, I only answer e-mail on every other Tuesday.
Any design notes on the data analysis are in our private e-mail, and we will fight to the death -- up to and including ignoring FOIA requests -- to prevent you from obtaining them.
Meanwhile we will continue publishing exciting sounding (but irerproducible) analyses, and submitting grants based on them, because that's the only thing that the reviewers care about.
sincerely yours,
--titus
(representing every computational scientist in the world.)
posted at: 08:18 | path: /may-10 | 6 comments
Sun, 02 May 2010
The "Avida three" ride again.
Just got news that the BEACON NSF Science and Technology Center for the study of Evolution in Action funded Chris Adami to come do a sabbatical here at Michigan State University for the next year. This puts me, Chris, and Charles Ofria at the same institution (now MSU, then Caltech) for the first time since 1993, other than a brief overlap in ~2000. 1993 was when we designed and implemented Avida. (Avida must be one of the most long-lived summer research projects ever...)
Should I be upset that Charles and Chris both have Wikipedia pages, and Charles is an Associate Professor and Chris a full Professor, while I am only a lowly Assistant Prof, even though I'm only a year younger than Charles? Naah -- I don't have any grey hair yet. I'll take the trade... ;)
The "family" has also grown. We are all married, 2 of us have kids, and we all have a bunch of students and postdocs, too. It's gettin' crowded around here: should be fun!
--titus
posted at: 20:50 | path: /may-10/may-10 | 0 comments