Tue, 08 Jun 2010

Running a next-gen sequence analysis course using Amazon Web Services


So, I've been teaching a course on next-generation sequence analysis for the last week, and one of the issues I had to deal with before I proposed the course was how to deal with the volume of data and the required computation.

You see, next-generation sequence analysis involves analyzing not just entire genomes (which are, after all, only 3gb or so in size) but data sets that are 100x or 1000x as big! We want to not just map these data sets (which is CPU-intensive), but also perform memory-intensive steps like assembly. If you have a class with 20+ students in it, you need to worry about a lot of things:

  • computational power: how do you provide 24 "good" workstations
  • memory
  • disk space
  • bandwidth
  • "take home" ability

One strategy would be to simply provide some Linux or Mac workstations, with cut-down data sets. But then you wouldn't be teaching reality -- you'd be teaching a cut-down version of reality. This would make the course particularly irrelevant given that one of the extra-fun things about next-gen sequence analysis is how hard it is to deal with the volume of data. You also have to worry that the course would be made even more irrelevant because the students would leave the course and be unable to use the information without finding infrastructure and installing a bunch of software and then administering the machine.

While I enjoy setting up computers and installing software and managing users, I'm clearly masochistic. It's also entirely besides the point for bioinformaticians and biologists - they just want to analyze data!

The solution I came up with was to use Amazon Web Services and rent some EC2 machines. There's a large variety of hardware configurations available (see instance types) and they're not that expensive per hour (see pricing).

This has worked out really, really well.

It's hard to enumerate the benefits, because there have been so many of them ;). A few of the obvious ones --

We've been able to write tutorials (temporary home here: http://ged.msu.edu/angus/) that make use of specific images and should be as future-proof as they can be. We've given students cut and paste command lines that Just Work, and that they can tweak and modify as they want. If it borks, they always just throw it away and start from a clean install.

It's dirt cheap. We spent less than $50 the first week, for ~30 people using an average of 8 hours of CPU time. The second week will increase to an average of 8 hours of CPU time a day, and for larger instances -- so probably about $300 total, or maybe even $500 -- but that's ridiculously cheap, frankly, when you consider that there are no hardware issues or OS re-install problems to deal with!

Students can choose whatever machine specs they need in order to do their analysis. More memory? Easy. Faster CPU needed? No problem.

All of the data analysis takes place off-site. As long as we can provide the data sets somewhere else (I've been using S3, of course) the students don't need to transfer multi-gigabyte files around.

The students can go home, rent EC2 machines, and do their own analyses -- without their labs buying any required infrastructure.

Home institution computer admins can use the EC2 tutorials as documentation to figure out what needs to be installed (and potentially, maintained) in order for their researchers to do next-gen sequence analysis.

The documentation should even serve as a general set of tutorials, once I go through and remove the dependence on private data sets! There won't be any need for students to do difficult or tricky configurations on their home machines in order to make use of the tutorial info.

So, truly awesome. I'm going to be using it for all my courses from now on, I think.

There have been only two minor hitches.

First, I'm using Consolidated Billing to pay for all of the students' computer use during the class, and Amazon has some rules in place to prevent abuse of this. They're limiting me to 20 consolidated billing accounts per AWS account, which means that I've needed to get a second AWS account in order to add all 30 students, TAs, and visiting instructors. I wouldn't even mention it as a serious issue but for the fact that they don't document it anywhere, so I ran into this on the first day of class and then had to wait for them to get back to me to explain what was going on and how to work around it. Grr.

Second, we had some trouble starting up enough Large instances simultaneously on the day we were doing assembly. Not sure what that was about.

Anyway, so I give a strong +1 on Amazon EC2 for large-ish style data analysis. Good stuff.

cheers, --titus

posted at: 07:52 | path: /jun-10 | 1 comments

Tags: , , ,


Fri, 21 May 2010

Help! Help! Class notes site?


So, I'm running this summer course and I am trying to figure out how to organize the notes for students. I'd like to mix curriculum-specific notes ("here's what we're doing today, and here are some problems to work on") with tutorials (material independent of a single course, like "here's how to transfer files between computers" or "here's how to parse CSV files"), and allow students to search the documents, annotate them in their Web browser, search the annotations, and perhaps even do public or private bookmarking and tagging. The ability to edit the primary content in something other than a Web GUI would be really, really nice, too -- that way I can write in something like ReST and then upload into the system.

(This is a system I could write myself, but that's kind of silly, dontcha think?)

It should also be lightweight, reasonably mature, easy to set up, and (preferably) written in Python, although I'm willing to compromise on the last simply because I'm desperate.

Pointers, comments, suggestions welcome!

--titus

posted at: 08:22 | path: /may-10 | 7 comments

Tags: , ,


Sat, 02 Jan 2010

Managing student expectations for open-source projects


On the heels of my aggressive competence post, about (among other things) my failure to outline my expectations for students, I've started putting together a page to help manage student expectations for the pony-build project, which is participating in the Undergraduate Capstone Open-Source Projects (UCOSP) course this term.

(Please comment over at the Wordpress blog for UCOSP:

http://ucosp.wordpress.com/2010/01/02/managing-student-expectations/

so the students can see your words of wisdom!)

--titus

posted at: 13:06 | path: /jan-10 | 1 comments

Tags: ,


Tue, 08 Sep 2009

Buggy Python code?


I'm looking for examples of frustratingly simple-yet-wrong Python code, suitable for an undergrad class to debug. I'd prefer things that don't rely on tricky features of Python (like shared list references), but rather code where subtly bad logic or program flow leads to bad behavior.

Comment below, or e-mail me; I'll post the ones I pick later. thanks!

--titus

posted at: 19:13 | path: /sep-09 | 27 comments

Tags: ,


Thu, 02 Apr 2009

Withrow Award for Teaching Excellence


Just a short note with characteristic inhumility (ahumility? abhumility?) -- for my Concepts in Database-Backed Web Programming course, I received the Withrow Award for Teaching Excellence from the students.

This means a lot to me, because I spent a huge amount of time on that course (and will have to do so again next fall!) I trace the students' relative happiness with my course to a few particular issues:

  • I gave almost everyone an A or a B. This will change next year ;)

  • I was as close to "five nines" available as I could be: e-mail, office hours, etc. Next year, twitter?

  • I did my best to make the lectures entertaining and informative. (Anyone who watched me publicly insult Django's test framework at PyCon just for the hell of it knows what I mean by "entertaining".)

    I'm particularly proud of my repeated references to "evil Chinese hackers" -- next year, it will be "evil Canadian hackers", however. Sorry, Greg.)

Continuing the inhumility, I will also mention that the Dean of Engineering said that he'd never had a student come to compliment him on a professor's teaching before -- normally they just want to bitch.

No matter how nice it is to have the students like my teaching, though, I definitely have a lot of work to do on the class; I rather failed to teach proper programming practice, looking at some recent student work. Sigh. Fall, here I come!

On a separate note, Ryan Wagoner posted about the MSU CS program. I am trying to address at least two of the four problems at the end...

--titus

p.s. Re Django, that was all Jesse Noller's fault. He made me do it -- remote control via Twitter.

p.p.s. Django's test framework is, in fact, mildly fscked. I haven't yet figured out if it's for a good reason or not -- that's another post ;)

posted at: 07:20 | path: /apr-09 | 8 comments

Tags: ,