Fri, 08 Jul 2011
Course: Analyzing Next-Generation Sequencing Data (2011 version)
The second iteration of our bioinformatics summer course, Analyzing Next-Generation Sequencing Data, just finished. It was a great success, at least judging from the comments that people made to us personally; the evaluations aren't yet complete.
The what: a two week course on analyzing next-gen sequencing data, using the Amazon cloud, for biologists with no prior computational experience.
The who: 2 instructors (myself and Ian Dworkin), 4 TAs (Jason, Likit, Rose, and Chris), and two visiting faculty (Istvan Albert and Erich Schwarz). 24 students, from 5 of the seven continents (Australia, South America, Europe, Asia, and the US).
The where: two weeks out at the lush Kellogg Biological Station, a research station in west mid-Michigan (just north of Kalamazoo).
As with last year's course, we taught students to use Amazon AWS for all their computing needs, including compute and data storage. We had an education grant from Amazon this year, which let us play with more and niftier services: I used less (well, no) S3 and more/bigger EC2 machines with EBS volumes and snapshots. This worked fantastically well, with the main problem being the lousy connection between our room and the rest of the campus, which led to occasional bottlenecks as 24 students tried to bring up their machines. I don't know how many people are taking even partial advantage of AWS for their classes, but the sheer awesomeness of the cloud for teaching computer science and scientific computation is something that needs to be broadcast.
The students this year were more motivated, if possible, than last year's. If I had more than two points on the graph, I'd suggest that there was a trend here: everyone is getting this data, and nobody has the least #!%!@^@#% idea what to do with it. (Oh, heck, it's an obvious trend ;). One fallout from this is that we challenged the students with more scripting, and were rewarded -- most people seemed to think it was a good intro, although we'll have to see what happens down the road with that. (Also see this great Nat. Methods technology feature, replete with quotes from yours truly, about programming in biology.)
The students this year were also more diverse. Last year we had 33 applicants for 24 slots, and I accepted 24 and 23 came. This year we had 133 applicants (!!) for 24 slots, and all 24 were taken (although I had to fill a slot from one student who simply didn't show up. Bad person. No cookie for you.). This led to more diversity in backgrounds, skill sets, and training. We even let a bioinformatician slip in to the course by accident ;')
Very interestingly, the range of critters under study were more diverse. This was partly due to selection criteria -- Ian and I did most of the selecting, and as we are both evolution-curious, we tried to pick a diverse group of critters. Notables included: two cnidarian researchers, one dinoflagellate researcher (their critter had a freaking' 64 GB genome... that's 21 times human!), a last-minute lophotrochozoan person, someone working with 20-odd microbial genomes, a mouse/human xenografter, someone studying a symbiont of a symbiont, and someone working on strange, edible sea critters, as well as many others that I'm forgetting about. I think another trend that is probably completely obvious to those of working in evolution studies is that literally everything on this planet is getting blended up and thrown into sequencers -- and there ain't no reference genomes for these critters...
As with last year, we tried to pass on the message that this was a very immature and fast moving field, and you had to worry a lot about the quality of your sequence, on top of all the other issues. Interestingly, this was a strong message from Istvan Albert's talk, too; I know that Ian and I both tend to be very skeptical of next-gen sequence quality, but people are sometimes skeptical of the strength of our skepticism, so it was nice to hear the same message from someone completely different!
Less of my hair follicles were lost this year due to Ian's participation as co-instructor as well as the awesome guest lecturers. That was nice. Sleep was still hard to find, but I had more fun and less moments of sheer terror.
All of our tutorials are copy-paste and freely available under a Creative Commons/Attribution license. Please use and abuse! There are tutorials on basic mapping and assembly, transcriptome analysis, ChIP-seq, and resequencing analysis available, among others.
One piece of good news for students that want to take this course: it looks like the 2012 course will be happening! Were it to happen, it would be around June 4 - June 15. Same bat place. I will announce it here and everywhere & link to it from the old course page.
A few other courses have sprung up. The three that I think look the best are: the UC Davis course, in September; the NESCent course; and the Ft. Collins workshop on Comparative Genomics, happening in July (next week!). CSHL is running a few different courses (programming for biologists, advanced sequencing technologies, and computational/comparative genomics). The Davis course is the only one that uses the cloud much, which is a shame; the cloud usage is something that a lot of people seem to like about our course, and it's really convenient for people from institutions without significant compute resources. (Note that I know the NESCent course, at least, was just as oversubscribed as ours. There's room out there for more courses!)
BTW, we'd love to find someone interested in writing up our course (and maybe these others?) for a news piece in some journal. Drop me a note at ctb@msu.edu if you are interested!
--titus
posted at: 08:58 | path: /jun-11 | 0 comments
Mon, 14 Jun 2010
Teaching next-gen sequencing data analysis to biologists
Our sequencing analysis course ended last Friday, with an overwhelmingly positive response from the students. The few negative comments that I got were largely about organizational issues, and could be reshaped as suggestions for next time rather than as condemnations of this year's course.
The 23 students -- most with no prior command-line experience -- spent two weeks experiencing at first hand the challenges of dealing with dozens of gigabytes of sequencing data. Each of the students went through genome-scale mapping, genome assembly, mRNAseq analysis on an "emerging model organism" (a.k.a "one with a crappy genome", lamprey), resequencing analysis on E. coli, and ChIP-seq analysis on Myxococcus xanthus. By the beginning of the second week, many students were working with their own data -- a real victory. Python programming competency may take a bit longer, but many of them seem motivated.
If you had told me three weeks ago that we could pull this off, I would have told you that you were crazy. This does beg the question of what I was thinking when I proposed the course -- but don't dwell on that, please...
The locale was great, as you can see:
One of the most important lessons of the course for me is that cloud computing works well to backstop this kind of course. I was very worried about the suitabiliy and reliability and ease of use, but AWS did a great job, providing an easy-to-use Web interface and a good range of machine images. I have little doubt that this course would have been nearly impossible (and either completely ineffective or much more expensive) without it.
In the end, we spent more on beer than on computational power. That says something important to me :)
The course notes are available under a CC license although they need to be reworked to use publicly available data sets before they become truly useful. At that point I expect them to become awesomely useful, though.
From the scientific perspective, the students derived a number of significant benefits from the course. One that I had not really expected was that some students had no idea what went in to computational "sausage", and were kind of shocked to see what kinds of assumptions us comp bio people made on their behalf. This was especially true in the case of students from companies, who have pipelines that are run on their data. One student lamented that "we used to look at the raw traces... now all we get are spreadsheet summaries!" Another student came to me in a panic because they didn't realize that there was no one true answer -- that that was in fact part of the "fun" of all biology, not just experimental biology. These reactions alone made teaching the course worthwhile.
Of course, the main point is that many of the students seem to be capable of at least starting their own analyses now. I was surprised at the practical power of our cut-and-paste approach -- for example, if you look at the Short-read assembly with ABySS tutorial, it turns out to be relatively straightforward to adapt this to doing assemblies of your own genomic or transcriptomic data. I based our approach on Greg Wilson's post on the failure of inquiry-based teaching and so far I like it.
I am particularly amused that we have now documented, in replicable detail, the Kroos Lab MrpC ChIP analysis. We also have the best documentation for Jeff Barrick's breseq software, I think; this is what is used to analyze the Long Term Evolution Experiment lines -- and I can't wait for the anti-evolutionists to pounce on that... "Titus Brown -- making evolution experiments accessible to creationists." Yay?
There were a number of problems and mistakes that we had to steamroller through. In particular, more background and more advanced tutorials would have be great, but we just didn't have time to write them. Some 454, Helicos, and SOLiD data sets (and next year, PacBio?) would be a good addition. We had a general lack of multiplexing data, which is becoming a Big Thing now that sequencing is so ridiculously deep. I would also like to introduce additional real data analyses next year, reprising things like the Cufflinks analysis and whole-vertebrate-genome ChIP-seq/mRNAseq a la the Wold Lab. I'm weighing adding metagenomics data analysis in for a day, although it's a pretty separate field of inquiry (and frankly much harder in terms of "unknown unknowns"). We also desperately need some plant genomics expertise, because frankly I know nothing about plant genomes; my last-minute plant genomics TA fell through due to lack of planning on my part. (Conveniently, plant genomics is something MSU is particularly good at, so I'm sure I can find someone next year.)
Oops, did I say next year? Well, yes. If I can find funding for my princely salary, then I will almost certainly run the course again next year. I can cover TAs and my own room/board and speakers with workshop fees, but if I'm going to keep room+board+fees under $1000/student -- a practical necessity for most -- there's no way I can pay myself, too. And while this year I relied on my lovely, patient, and frankly long-suffering wife to hold down the home fort while I was away for two weeks, I simply can't put her through that again, so I will need to pay for a nanny next year. So doing it for free is not an option.
In other words, if you are a sequencing company, or an NIH/NSF/USDA program director, interested in keeping this going, please get in touch. I plan to apply for this Initiative to Maximize Research Education in Genomics in September, but I am not confident of getting that on the first try, and in any case I will need letters of support from interested folks. So drop me a note at ctb@msu.edu.
Course development this year was sponsored by the MSU Gene Expression in Disease and Development, to whom I am truly grateful. The course would simply not have been possible without their support.
My overall conclusion is that it is possible to teach bench biologists with no prior computational experience to achieve at least minimal competency in real-world data analysis of next-generation sequencing data. I can't conclusively demonstrate this without doing a better job of course evaluation, and of course only time will tell if it sticks for any of the students, but right now I'm feeling pretty good about the course overall. Not to mention massively relieved.
--titus
p.s. Update from one student -- "It's not even 12 o'clock Monday morning and I'm already getting people asking me how to run assemblies and analyze data." Heh.
posted at: 08:38 | path: /jun-10 | 0 comments
Tue, 08 Jun 2010
Running a next-gen sequence analysis course using Amazon Web Services
So, I've been teaching a course on next-generation sequence analysis for the last week, and one of the issues I had to deal with before I proposed the course was how to deal with the volume of data and the required computation.
You see, next-generation sequence analysis involves analyzing not just entire genomes (which are, after all, only 3gb or so in size) but data sets that are 100x or 1000x as big! We want to not just map these data sets (which is CPU-intensive), but also perform memory-intensive steps like assembly. If you have a class with 20+ students in it, you need to worry about a lot of things:
- computational power: how do you provide 24 "good" workstations
- memory
- disk space
- bandwidth
- "take home" ability
One strategy would be to simply provide some Linux or Mac workstations, with cut-down data sets. But then you wouldn't be teaching reality -- you'd be teaching a cut-down version of reality. This would make the course particularly irrelevant given that one of the extra-fun things about next-gen sequence analysis is how hard it is to deal with the volume of data. You also have to worry that the course would be made even more irrelevant because the students would leave the course and be unable to use the information without finding infrastructure and installing a bunch of software and then administering the machine.
While I enjoy setting up computers and installing software and managing users, I'm clearly masochistic. It's also entirely besides the point for bioinformaticians and biologists - they just want to analyze data!
The solution I came up with was to use Amazon Web Services and rent some EC2 machines. There's a large variety of hardware configurations available (see instance types) and they're not that expensive per hour (see pricing).
This has worked out really, really well.
It's hard to enumerate the benefits, because there have been so many of them ;). A few of the obvious ones --
We've been able to write tutorials (temporary home here: http://ged.msu.edu/angus/) that make use of specific images and should be as future-proof as they can be. We've given students cut and paste command lines that Just Work, and that they can tweak and modify as they want. If it borks, they always just throw it away and start from a clean install.
It's dirt cheap. We spent less than $50 the first week, for ~30 people using an average of 8 hours of CPU time. The second week will increase to an average of 8 hours of CPU time a day, and for larger instances -- so probably about $300 total, or maybe even $500 -- but that's ridiculously cheap, frankly, when you consider that there are no hardware issues or OS re-install problems to deal with!
Students can choose whatever machine specs they need in order to do their analysis. More memory? Easy. Faster CPU needed? No problem.
All of the data analysis takes place off-site. As long as we can provide the data sets somewhere else (I've been using S3, of course) the students don't need to transfer multi-gigabyte files around.
The students can go home, rent EC2 machines, and do their own analyses -- without their labs buying any required infrastructure.
Home institution computer admins can use the EC2 tutorials as documentation to figure out what needs to be installed (and potentially, maintained) in order for their researchers to do next-gen sequence analysis.
The documentation should even serve as a general set of tutorials, once I go through and remove the dependence on private data sets! There won't be any need for students to do difficult or tricky configurations on their home machines in order to make use of the tutorial info.
So, truly awesome. I'm going to be using it for all my courses from now on, I think.
There have been only two minor hitches.
First, I'm using Consolidated Billing to pay for all of the students' computer use during the class, and Amazon has some rules in place to prevent abuse of this. They're limiting me to 20 consolidated billing accounts per AWS account, which means that I've needed to get a second AWS account in order to add all 30 students, TAs, and visiting instructors. I wouldn't even mention it as a serious issue but for the fact that they don't document it anywhere, so I ran into this on the first day of class and then had to wait for them to get back to me to explain what was going on and how to work around it. Grr.
Second, we had some trouble starting up enough Large instances simultaneously on the day we were doing assembly. Not sure what that was about.
Anyway, so I give a strong +1 on Amazon EC2 for large-ish style data analysis. Good stuff.
cheers, --titus
posted at: 07:52 | path: /jun-10 | 1 comments