I've started to think more broadly about bioinformatics training, and
after some conversations with Vicky Schneider at TGAC, Terri
Atwood at GOBLET, and others, I thought I'd write down some thoughts
on bioinformatics classrooms. In particular, what kind of compute
infrastructure is needed?
Before I get started, my assumptions and interests are as follows:
- Many students will be fairly low-tech. (Practically speaking, it's
a fair assumption that even some of those who claim to be awesome
superhackers will be missing some essential skill, so setting the
base technical expectations fairly low is the safe way to go.)
- The audience will generally be mostly bio-, not comp-. (That's
just who I teach.)
- Technology is the devil. Anything that requires the teacher to
jump through hoops while teaching, remember to press buttons, or
otherwise debug and adjust while standing in front of 15 or more
people while trying desperately to evince high IQ, is FAIL.
(If you haven't taught in a hi tech environment, you don't have the
right to tell anyone how easy it is to just remember to fribable
the bibpop every 5 minutes.)
So, what are the options?
Option 1: Participants bring their own computers
One approach for training is that participants simply bring their own
computers to the classroom. Students install the necessary software
and follow along with the workshops materials. (This is how Software
Carpentry workshops are designed, so I've taught ~5-6 workshops this
way.)
The upsides are that students can immediately use whatever they learn
in the course, and there's little or no investment needed to set up
or maintain the computers in this workshop environment.
The downsides, however, are many:
Necessary software needs to be installed on many different laptops.
This can be difficult, depending on what needs to be installed and
what OS the laptop is running.
(Many of my worst bootcamp teaching experiences come from this.)
Students must usually have reasonably up-to-date laptops, OR be able to
run virtual machines.
(Software Carpentry has had reasonably good luck with VirtualBox, but
it's still a foreign environment for most; it seems like a poor
compromise.)
Depending on the tasks being demonstrated in the workshop, student
computers may need to have more memory, disk space, or compute than
most biologists will have.
(For the NGS course we run de
novo assembly software, which may require 8-16 GB of RAM. Most
laptops don't have that much RAM.)
Some of the instructors or TAs need to be able to work on multiple platforms,
so that they can fix install problems.
(A typical problem in Software Carpentry is that many of the instructors
have Macs or Linux machines, and aren't necessarily very good at
debugging Windows installs.)
Most cutting edge bioinformatics software runs primarily or best on Linux,
and may not run on Mac OS or (especially) Windows.
(This leads to two concerns: first, you might have to choose demo
software that is particularly portable, rather than the software you
would prefer to teach; second, you're handicapping students in the long
term by not teaching them that, for better or for worse, Linux is
actually where most bioinformatics is done.)
Workshop materials must be generic enough to work on multiple platforms,
with different path layouts, OS versions, etc.
(Writing workshop materials is an immense amount of work. Having to
make them flexible in this way, and testing them and keeping them
working on multiple platforms, is even more work.)
Option 2: Pre-installed machines
Another approach for training is to use pre-installed machines (and/or
virtual machines running on existing hardware). There's a classroom
with a bunch of hardware; the instructor
specifies software to be installed; someone (IT support, or the
instructor) installs it; students come to the classroom and use
machines that have been pre-loaded with the software needed for the
course.
On the surface, this seems like a great idea. The machine can be configured
as needed, the software is guaranteed to work, the environments are uniform,
and life is good. I've run several workshops this way and it can work
OK.
But, again, there are actually quite a few problems.
You'll need some form of dedicated IT support to configure, maintain,
and support the machines.
(This can range from merely an additional task on top of a standard
computing environment, as with MSU's Computer Science classrooms,
where we have our regular sysadmins maintain a standard set of
computers across many different classrooms; to a periodically scheduled
support request whenever a workshop is run. No matter what this can
be a significant extra burden.)
The machines and environments are unfamiliar to the students.
(Whatever you run will be unfamiliar to 20% or more of the students,
and even on the fairly standard Mac OS X environment, there are
so many ways to customize it that native users will inevitably struggle
with the differences.)
The students generally won't be able to take what they've learned back
"home" with them.
(Even when students are at the same institution as the workshop, they
often won't have access to the training suite or environment afterwards.
This is especially true if the environment was custom-installed for the
instructor.)
Most places won't want to provide substantial compute capacity for the
instructional lab. This either limits what can be taught, OR means
that students will also have to gain access to a local compute cluster,
which complicates things further.
(Most compute clusters are not thrilled at the idea of setting up specialized
access for workshop attendees, either, in my experience.)
The instructor needs to come up with specialized install
instructions for the software they need installed on whatever
environment the workshop will use.
(This can be a significant additional burden on the
instructor. Moreover, since the IT support will generally not be
expert in whatever the instructor needs installed, the instructor
will also need to verify the installs.)
Option 3: Cloud computing
A third option is to use remote-hosted virtual machines (aka cloud
computing). The idea here is that the instructor specifies some
cloud service (Amazon, Rackspace, iPlant) to which all students
can have access; s/he provides a customized virtual machine with
some or all of the necessary software installed; and students use
the virtual machines remotely via either their laptop or provided
workstations.
It will come as no surprise to readers of my blog that this is my
favorite option. It has much to recommend it: participants can use
their own computers, their own Web browser, and whatever SSH program
they like (Windows is the only OS that doesn't come with SSH
natively). Graphic interaction can be supported either via X Windows
(ugh) or IPython Notebook or knitr. Students can bring home their
expertise, assuming the cloud platform is still available to them at
home; alternatively, if their home institution provides hosted VMs,
they can use that. Compute can be scaled up, or down, as needed for
whatever is being taught -- Amazon now rents machines with over 200 GB
of RAM, for example.
I've now taught over a dozen workshops this way, with a high degree
of success (at least in terms of the technical side.)
Problems, nonetheless, abound:
Some institutions, labs, and funding agencies don't want to use remote
computers for legal or other reasons (think HIPAA).
(I haven't run into this myself; but, following my life goal of
minimizing face time with lawyers, working out the legalities an be
problematic. Do note that the NSA uses Amazon Web Services for some
things, so it's a little hard to believe that something couldn't be
worked out for medical or other sensitive data.)
Sharing files between local and remote is a perennial problem
(I usually use Dropbox, which provides a command-line installable
client.)
Few people are prepared to edit files remotely.
(Well, frankly, few people are prepared to edit text files at all.
I use either IPython Notebook or pico remotely, OR encourage people
to edit things in Dropbox or on github.)
You need reliable network access and decent servers for the material
and the data you're using.
(In practice this is a must for most classrooms these days, but it
can still be an problem when 30 people are clicking on the same link
to download the same 50 MB file.)
Cloud computing frequently costs money.
(Amazon and Rackspace both charge money; iPlant could be free, but is
still only dipping its toes into this area. Grants can usually be
obtained to help with the costs during the workshop, but what to students
do when they get home? Our local HPC has taken some of our cloud
instructions and used that to install the software locally, which is a
pretty neat idea and something we're pursuing more generally.)
The tech competencies needed to set up and work with cloud machines
is a bit more specialized than local sysadminning.
(Some reasonably significant expertise, or at least time investment,
is needed to get familiar with setting everything up. It's, uhh,
fairly easy when you've spent several years doing it ... ;))
Conclusions
There are no panaceas.
I like the cloud. It's served me well.
What did I miss?
--titus
There are comments.