Sun, 07 Aug 2011

On mentoring


One of the most important jobs a professor has is to pay it forward: that is, to teach, train, mentor, support, and open up opportunities for their students and postdocs. It's a job that is undervalued by those who focus on the short term -- the administrators and review committees that judge us by the money we bring in and the papers we publish. It's a misunderstood job, as well; the goal of a good mentor is not to mold their students in their own image, but to push them; to expand their horizons, not to contract them around the mentor's own views. And it's probably one of the two or three most rewarding parts of the job (surpassed only by the fun of actually doing the research!)

I came to academia to do science: to research biology, and to solve problems that are as yet unsolved. And my goal is to become a good researcher, and to solve really hard problems well; hopefully that will be part of whatever legacy I have. But it's increasingly clear to me that the best, most lasting legacy possible is to train the next generations of scientists in doing good science, be it in industry or academia. For example, my father trained nearly 100 PhD students during his research career, and always felt that his training record was one of his most impactful contributions -- many of those 100 students are professors today, doing their own training. Pay it forward, indeed!

Recently, I realized that if I drop dead tomorrow (or don't get tenure in three years -- same thing, right?) I would still have touched a number of people's lives in really positive ways. For example, I received an e-mail from one student that participated in GSoC, telling me that she felt she had learned an immense amount about self-sufficiency and the value of making an effort from that experience. I hired another student to do lab grunt work; she ended up liking science, switching majors, graduating, and has now received several national fellowships and is going to graduate school. I don't take credit for much past hiring her, but if I hadn't hired her, she would not have had the opportunity to show how good she is. Blogging and writing tutorials counts, too: at PyCon, 2-3 people a year come up to me and tell me how much they appreciate one or another of my blog posts or tutorials. Even class teaching can be impactful -- at graduation recently, an entire group of CS students told me how much they'd enjoyed my class on Web dev, with one particular student introducing me to her parents as "one of the good ones".

It's hard to overstate how fantastic it is to watch students grow and change.

Despite the joys of transforming lives, many professors only want to take energetic, intelligent, well-spoken, well-trained students. Yet these students are the ones that already have opportunities, and frankly need less mentoring than others -- they already "get it", and really just need experience. Providing that experience is valuable, and a key part of training. But the same professors that jump at the 4.0 GPA student who speaks English natively and has tons of energy will turn down opportunities to take students from non-research intensive institutions, or unfocused students who come from non-academic backgrounds, or people who haven't the faintest idea what science is. And I think they do themselves and the students a disservice. These students simply have never seen the same opportunities that many students at (e.g.) MSU take for granted; the difference you could make in their lives dwarfs the impact you'd make on most better prepared students.

It's worth remembering that at some point we all start as wet-behind-the-ears youngsters that have never confronted a problem without an answer. Someone took a chance on us -- maybe it was easier to get that chance for those of us from a top-ranked institution, with an academic background (...me), or maybe it was a random "hey, do you want to work for me over the summer?" chance (...lots of people), or maybe it was an organized program to introduce underrepresented minorities to research (...lots of people). Now that you're a grad student, or a postdoc, or a professor, or an open source hacker with commit privileges to a dozen projects, take that chance. It's a lot of work but it can be more rewarding than pretty much anything else.

If I have a point with this blog post, it's this: one of the privileges of academia, and mentoring programs like GSoC, is being put in a position where you can touch many people's lives, as a mentor and a teacher. Do so! Take chances on people! Lay out some expectations and see who rises to meet them, and then chase down those that don't understand what to do. Don't start out with the expectation that you must be, or will be, rewarded in kind -- many students take years to be productive, may end up working on something completely different with someone else, and may never even say thanks. But that's OK, if you don't treat teaching and undergraduate research as a way to get more work done ('cause honestly, it's not), but rather as an opportunity to introduce the joys of your work to people who may have no idea that such awesome jobs exist. Pay it forward.

--titus

posted at: 11:33 | path: /aug-11 | 3 comments

Tags: ,


Fri, 08 Jul 2011

Course: Analyzing Next-Generation Sequencing Data (2011 version)


The second iteration of our bioinformatics summer course, Analyzing Next-Generation Sequencing Data, just finished. It was a great success, at least judging from the comments that people made to us personally; the evaluations aren't yet complete.

The what: a two week course on analyzing next-gen sequencing data, using the Amazon cloud, for biologists with no prior computational experience.

The who: 2 instructors (myself and Ian Dworkin), 4 TAs (Jason, Likit, Rose, and Chris), and two visiting faculty (Istvan Albert and Erich Schwarz). 24 students, from 5 of the seven continents (Australia, South America, Europe, Asia, and the US).

The where: two weeks out at the lush Kellogg Biological Station, a research station in west mid-Michigan (just north of Kalamazoo).

As with last year's course, we taught students to use Amazon AWS for all their computing needs, including compute and data storage. We had an education grant from Amazon this year, which let us play with more and niftier services: I used less (well, no) S3 and more/bigger EC2 machines with EBS volumes and snapshots. This worked fantastically well, with the main problem being the lousy connection between our room and the rest of the campus, which led to occasional bottlenecks as 24 students tried to bring up their machines. I don't know how many people are taking even partial advantage of AWS for their classes, but the sheer awesomeness of the cloud for teaching computer science and scientific computation is something that needs to be broadcast.

The students this year were more motivated, if possible, than last year's. If I had more than two points on the graph, I'd suggest that there was a trend here: everyone is getting this data, and nobody has the least #!%!@^@#% idea what to do with it. (Oh, heck, it's an obvious trend ;). One fallout from this is that we challenged the students with more scripting, and were rewarded -- most people seemed to think it was a good intro, although we'll have to see what happens down the road with that. (Also see this great Nat. Methods technology feature, replete with quotes from yours truly, about programming in biology.)

The students this year were also more diverse. Last year we had 33 applicants for 24 slots, and I accepted 24 and 23 came. This year we had 133 applicants (!!) for 24 slots, and all 24 were taken (although I had to fill a slot from one student who simply didn't show up. Bad person. No cookie for you.). This led to more diversity in backgrounds, skill sets, and training. We even let a bioinformatician slip in to the course by accident ;')

Very interestingly, the range of critters under study were more diverse. This was partly due to selection criteria -- Ian and I did most of the selecting, and as we are both evolution-curious, we tried to pick a diverse group of critters. Notables included: two cnidarian researchers, one dinoflagellate researcher (their critter had a freaking' 64 GB genome... that's 21 times human!), a last-minute lophotrochozoan person, someone working with 20-odd microbial genomes, a mouse/human xenografter, someone studying a symbiont of a symbiont, and someone working on strange, edible sea critters, as well as many others that I'm forgetting about. I think another trend that is probably completely obvious to those of working in evolution studies is that literally everything on this planet is getting blended up and thrown into sequencers -- and there ain't no reference genomes for these critters...

As with last year, we tried to pass on the message that this was a very immature and fast moving field, and you had to worry a lot about the quality of your sequence, on top of all the other issues. Interestingly, this was a strong message from Istvan Albert's talk, too; I know that Ian and I both tend to be very skeptical of next-gen sequence quality, but people are sometimes skeptical of the strength of our skepticism, so it was nice to hear the same message from someone completely different!

Less of my hair follicles were lost this year due to Ian's participation as co-instructor as well as the awesome guest lecturers. That was nice. Sleep was still hard to find, but I had more fun and less moments of sheer terror.

All of our tutorials are copy-paste and freely available under a Creative Commons/Attribution license. Please use and abuse! There are tutorials on basic mapping and assembly, transcriptome analysis, ChIP-seq, and resequencing analysis available, among others.

http://ivory.idyll.org/permanent/ngs-2011-bbq.png

One piece of good news for students that want to take this course: it looks like the 2012 course will be happening! Were it to happen, it would be around June 4 - June 15. Same bat place. I will announce it here and everywhere & link to it from the old course page.

A few other courses have sprung up. The three that I think look the best are: the UC Davis course, in September; the NESCent course; and the Ft. Collins workshop on Comparative Genomics, happening in July (next week!). CSHL is running a few different courses (programming for biologists, advanced sequencing technologies, and computational/comparative genomics). The Davis course is the only one that uses the cloud much, which is a shame; the cloud usage is something that a lot of people seem to like about our course, and it's really convenient for people from institutions without significant compute resources. (Note that I know the NESCent course, at least, was just as oversubscribed as ours. There's room out there for more courses!)

BTW, we'd love to find someone interested in writing up our course (and maybe these others?) for a news piece in some journal. Drop me a note at ctb@msu.edu if you are interested!

--titus

posted at: 08:58 | path: /jun-11 | 0 comments

Tags: , ,


Thu, 09 Dec 2010

(Some) Principles of Computational Science


I'm just finishing up my Computational Science for Evolutionary Biologists course, and I'm finding it tricky to come up with a good high-level summary of what I would like them to take away. As you can see from the class notes they've done some reasonably neat stuff with Digital Life and (separately!) next-gen sequence analysis, but the class has been somewhat random in its topics and train of thought.

Anyway, for the final class I decided I'd go slide by slide through a number of principles that they should apply if and when they find themselves doing computational science. In each case I can point to class exercises and homeworks that illustrate the points, which I think means I haven't totally failed... ;)

Anyway, here's what I have so far:


13 Principles of Computational Science:

1) Computational science is just like any other science: don't trust it if you don't understand it.

Seriously. Computers aren't magic, and computational jargon isn't any more meaningful than any other jargon.
  1. The entire chain of evidence matters.

Keep close track of the raw data; the analysis source code; and the parameters used at each stage of data generation, processing and summarization.

Corollary: Make your raw data available. To do otherwise is just silly.

  1. If it's not automated, it's crrrrrap
As soon as there's some manual step in your pipeline, you've lost track of what you're doing. You may do it differently, or not at all, or incorrectly. And you'll never know. You'll just get different results. Sometimes.
  1. Use version control.
If it's neither raw data (backed up!) nor generated data, put it in version control.
  1. Using other people's software to do science is hard.

They probably had some other use in mind that doesn't fit your needs, but you're going to try to adapt it anyway, aren't you? Good luck with that.

Corollary: using your own software to do science, 2 years after you wrote it, is hard -- because you're not you any more. (Remember, you can never step in the same stream twice.)

  1. No software is trustworthy.
Until you understand your software stack intuitively, have obsessed over parameter choices, and have locked down your software behavior with automated tests, don't trust it. After that, you can grudgingly extend some minimal trust to it, at least until the next version is released.
  1. Computation is not science.
Science is science. Computation may be one of the ways in which you do science.
  1. Hypotheses are good.

It's virtually impossible to analyze data without some kind of hypothesis in mind.

Corollary: Each hypothesis is only a starting point. It's probably wrong, so don't get too attached to it.

  1. More data is not necessarily less confusing.

The more data you have, the harder it can be to get a clean signal. Statistics help here, unless of course you have an unknown systematic bias in your data.

Corollary: You have an unknown systematic bias in your data.

  1. Interdisciplinary research is hard.

You need to be an expert in multiple fields, each with its own special techniques, lingo, and "commonly understood" shibboleths. Proper hypothesis testing involves mastering the first two; publication may depend on avoiding the latter.

Corollary: computational science is implicitly interdisciplinary, hence hard. (If it were easy, we wouldn't need smart people like you to do it, right?)

  1. A lot of computing is just details.
There's very little magical about computing. An awful lot of it is just more details to remember. Running software, gathering the results, processing them, plotting them, tweaking parameters, etc.
  1. Look at your data.
Look at your data, and your results, in as many ways as possible. You'll often be surprised by what's actually in there.
  1. Above all, tell a story.
Nobody is interested in just graphs. If you don't have an interesting story, dig deeper.

I know, somewhat scattered. Any more thoughts, or pointers to similar lists?

thanks,

--titus

p.s. I plan to finish up with my (IMO very underappreciated) principles of How to be a Successful Computational Scientist, summarized here:

  1. Never show them your data.
  2. Do not, under any circumstances, communicate clearly.
  3. Never release your source code, either.
  4. Judge computational science by results, not quality.
  5. Use as much data as possible.

Then they get to fill out evaluations. Whee!

posted at: 21:45 | path: /dec-10 | 2 comments

Tags: ,


Tue, 08 Jun 2010

Running a next-gen sequence analysis course using Amazon Web Services


So, I've been teaching a course on next-generation sequence analysis for the last week, and one of the issues I had to deal with before I proposed the course was how to deal with the volume of data and the required computation.

You see, next-generation sequence analysis involves analyzing not just entire genomes (which are, after all, only 3gb or so in size) but data sets that are 100x or 1000x as big! We want to not just map these data sets (which is CPU-intensive), but also perform memory-intensive steps like assembly. If you have a class with 20+ students in it, you need to worry about a lot of things:

  • computational power: how do you provide 24 "good" workstations
  • memory
  • disk space
  • bandwidth
  • "take home" ability

One strategy would be to simply provide some Linux or Mac workstations, with cut-down data sets. But then you wouldn't be teaching reality -- you'd be teaching a cut-down version of reality. This would make the course particularly irrelevant given that one of the extra-fun things about next-gen sequence analysis is how hard it is to deal with the volume of data. You also have to worry that the course would be made even more irrelevant because the students would leave the course and be unable to use the information without finding infrastructure and installing a bunch of software and then administering the machine.

While I enjoy setting up computers and installing software and managing users, I'm clearly masochistic. It's also entirely besides the point for bioinformaticians and biologists - they just want to analyze data!

The solution I came up with was to use Amazon Web Services and rent some EC2 machines. There's a large variety of hardware configurations available (see instance types) and they're not that expensive per hour (see pricing).

This has worked out really, really well.

It's hard to enumerate the benefits, because there have been so many of them ;). A few of the obvious ones --

We've been able to write tutorials (temporary home here: http://ged.msu.edu/angus/) that make use of specific images and should be as future-proof as they can be. We've given students cut and paste command lines that Just Work, and that they can tweak and modify as they want. If it borks, they always just throw it away and start from a clean install.

It's dirt cheap. We spent less than $50 the first week, for ~30 people using an average of 8 hours of CPU time. The second week will increase to an average of 8 hours of CPU time a day, and for larger instances -- so probably about $300 total, or maybe even $500 -- but that's ridiculously cheap, frankly, when you consider that there are no hardware issues or OS re-install problems to deal with!

Students can choose whatever machine specs they need in order to do their analysis. More memory? Easy. Faster CPU needed? No problem.

All of the data analysis takes place off-site. As long as we can provide the data sets somewhere else (I've been using S3, of course) the students don't need to transfer multi-gigabyte files around.

The students can go home, rent EC2 machines, and do their own analyses -- without their labs buying any required infrastructure.

Home institution computer admins can use the EC2 tutorials as documentation to figure out what needs to be installed (and potentially, maintained) in order for their researchers to do next-gen sequence analysis.

The documentation should even serve as a general set of tutorials, once I go through and remove the dependence on private data sets! There won't be any need for students to do difficult or tricky configurations on their home machines in order to make use of the tutorial info.

So, truly awesome. I'm going to be using it for all my courses from now on, I think.

There have been only two minor hitches.

First, I'm using Consolidated Billing to pay for all of the students' computer use during the class, and Amazon has some rules in place to prevent abuse of this. They're limiting me to 20 consolidated billing accounts per AWS account, which means that I've needed to get a second AWS account in order to add all 30 students, TAs, and visiting instructors. I wouldn't even mention it as a serious issue but for the fact that they don't document it anywhere, so I ran into this on the first day of class and then had to wait for them to get back to me to explain what was going on and how to work around it. Grr.

Second, we had some trouble starting up enough Large instances simultaneously on the day we were doing assembly. Not sure what that was about.

Anyway, so I give a strong +1 on Amazon EC2 for large-ish style data analysis. Good stuff.

cheers, --titus

posted at: 07:52 | path: /jun-10 | 1 comments

Tags: , , ,


Fri, 21 May 2010

Help! Help! Class notes site?


So, I'm running this summer course and I am trying to figure out how to organize the notes for students. I'd like to mix curriculum-specific notes ("here's what we're doing today, and here are some problems to work on") with tutorials (material independent of a single course, like "here's how to transfer files between computers" or "here's how to parse CSV files"), and allow students to search the documents, annotate them in their Web browser, search the annotations, and perhaps even do public or private bookmarking and tagging. The ability to edit the primary content in something other than a Web GUI would be really, really nice, too -- that way I can write in something like ReST and then upload into the system.

(This is a system I could write myself, but that's kind of silly, dontcha think?)

It should also be lightweight, reasonably mature, easy to set up, and (preferably) written in Python, although I'm willing to compromise on the last simply because I'm desperate.

Pointers, comments, suggestions welcome!

--titus

posted at: 08:22 | path: /may-10 | 7 comments

Tags: , ,