Mon, 27 Apr 2009

Software testing in science


As part of a CiSE submission I'm working on, I interviewed the lead developer on a scientific software package today. This software package is mainly used for evolutionary studies, and has a small but devoted following - ~6 developers and ~12 users locally, plus a few dozen users outside of MSU. I asked him a bunch of questions about development infrastructure and testing, and while the answers weren't surprising to me -- I've been friends with people on this team for over 15 years now and know the state of the software reasonabl well -- they did offer some food for thought.

The main testing method used for this software package is consistency tests, or what I would call regression tests: they compare assumed-good output from some old benchmark version to output from the current version. If they match, they declare victory: the current version reproduces the old results, so nothing is broken. The lead developer did say that he knew that the coverage of the consistency tests was poor, because when bugs in various parts of the code cropped up & were fixed, the consistency tests didn't often fail.

Other than that, there's essentially no unit or functional testing.

  1. How do you measure correctness in scientific software?

There's no yardstick, and often no previous results, just intuition based on what other people have discovered and predictions from mental models or mathematical models. This allows for fuzzy correspondence and isn't really a check on actual correctness, but rather on lack of obvious errors. Without going line-by-line or trying for formal proofs (neither of which academic programmers have any more patience for than any other programmer) you just have to trust that any major errors will pop up at some point.

  1. What's the incentive to be correct in scientific programming, anyway?

It boils down to "avoiding embarrassment," with a bit of "let's actually try to be useful, i.e. predictive or functional". There's no strong financial or career motivation to write correct software in academia; again, it's more an issue of avoiding obviously incorrect results. I don't mean that to sound harsh or manipulative -- it's frankly understandable, given that in research it's really difficult to figure out what the right answer should be, even when you can figure it out.

  1. What's the incentive to minimize maintenance costs?

There isn't really one, at least not for the first few years of a project. One of the main reasons I got involved in testing was in order to control maintenance costs for one of my projects; this, however, was an already rather successful scientific project, with a few hundred users. If you don't have a lot of users, why put effort into things like documentation, tutorials, testing and automated builds? Sure, you might attract users, but nobody really cares how many users you have. They care about publications.

  1. What's the measure of success?

At first (and second, and third) blush, it's "publication" -- did I produce an interesting enough result that I can publish it? Only after a fair number of publications do most scientists think about releasing their software.

posted at: 16:24 | path: /apr-09 | 0 comments

Tags: , ,


TALK: Open Source at Microsoft: The Past, Present, and Future


I'd like to invite you to attend the last of the Michigan State University CSE colloquia for the 2008-2009 academic year: jointly sponsored as an AT&T Visiting Lecturer by the MSU LCT, and the CSE department, Sam Ramji will speak about

Open Source at Microsoft: The Past, Present and Future

in CommArts room 147, Friday May 1, at 11:00am. I encourage you all to attend and to forward this on to others who might be interested! As you know, open source software is playing an increasingly big part in education, academia, science, and business, and so I expect this to be a very interesting talk.

Contact me at ctb@msu.edu for further information.

--

Abstract:

Since Microsoft established its Open Source Lab in Redmond more than five years ago, it has worked with many open source players to make Windows the best platform for all applications to run on. But this has not been without its challenges and there is a lot more work to be done on this front. This talk will cover the thinking behind Microsoft's current open source strategy and what this means for the software engineers of the future. It will also spotlight some innovative Open Source projects the company is supporting at universities across the world.

Biography:

Sam Ramji is the Senior Director of Platform Strategy leading Microsoft's platform strategy efforts across the company, including long-term strategic planning in the Windows Server and Tools organization. Sam's primary focus is to drive Microsoft's Linux and Open Source Strategy, working together with Microsoft technology development teams and open source communities to build interoperable solutions.

Prior to his current role at Microsoft, Sam was a Director of Emerging Business working on the Silicon Valley Campus where he managed relationships with Venture Capitalists and entrepreneurs. Prior to joining Microsoft, Sam led technical product strategy at BEA Systems, engineering teams building large-scale applications on Open Source software (at Ofoto.com) as well as hands-on development of client, client-server, and distributed applications on Unix, Windows, and Macintosh at prior companies.

Sam holds a Bachelor of Science degree in Cognitive Science from the University of California at San Diego, and is a member of the Institute for Generative Leadership.

posted at: 12:17 | path: /apr-09 | 3 comments

Tags: , , ,


Wed, 22 Apr 2009

Open Source is like a mistress


Open source coding is like a not-so-demanding mistress: I work on it at night, surreptitiously, after my wife and daughter are asleep. twill and figleaf are like bastard children, who only get attention when I can spare it from my "real" family (my teaching, research or my actual family, depending ;)

Sigh.

--titus

posted at: 22:45 | path: /apr-09 | 1 comments

Tags: ,


Mon, 20 Apr 2009

What is disco?


Anyone out there used disco (http://discoproject.org/)? Comments, good/bad/neutral?

From the page:

Disco is an open-source implementation of the Map-Reduce framework for distributed computing. As the original framework, Disco supports parallel computations over large data sets on unreliable cluster of computers.

The Disco core is written in Erlang, a functional language that is designed for building robust fault-tolerant distributed applications. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms or data processing tasks often only in tens of lines of code. This means that you can quickly write scripts to process massive amounts of data.

thanks!

--titus

posted at: 12:43 | path: /apr-09 | 4 comments

Tags: ,


Sun, 12 Apr 2009

A grab bag of GSoC frustrations


Students don't understand the process we go through, partly because we don't make it very transparent.

People don't view this as a chance to mentor students, but rather as a job application.

Google doesn't give us enough money ;)

posted at: 10:29 | path: /apr-09 | 0 comments

Tags: , ,