Sat, 29 Sep 2007
The Scientific Public License?
After our long software licensing discussion on the biology-in-python list, I realized that I wanted something different in a license for scientific software.
Specifically, I would like to attach the following clause to either a BSD or L/GPL style license:
Publications relying on derivative works of this software must publish all such derivative works under this same license.
The intent is that Dr. Joe Blow, if he chooses to use and modify my software to support his data analysis or computation and then publishes his results, must republish the software with his modifications.
Has anyone seen anything like this?
I should also link to the Bayh-Dole Act, which -- for scientists -- is an extremely important law, because it grants universities copyright to source code produced under federal funding. This is why Caltech and MSU own my code, rather than the government, and it's why they can force me to release my code under the GPL.
--titus
posted at: 13:37 | path: /sep-07 | 5 comments
Software Licensing
This month the newly minted biology-in-python mailing list erupted into a discussion of licenses. There was some confusion about the goal of the discussion, for which I'm largely responsible: we didn't make it clear that we were talking about licenses for code and content posted on the bio.scipy.org community Web site, so people were worried that we were trying to dictate license choices for all Python/bioinformatics software! Not at all! Anyway, I'm happy with the decision that we've posted, which is to place tutorial/example code under the BSD license, and discussion under Creative Commons/attribution.
A number of really interesting posts came through on this subject: Bruce Southey posted a number of interesting links, including Would Dostoevsky use the GPL? and Maintaining Permissive-Licensed Files in a GPL-Licensed Project. Josh Wilcox posted about a "grace period" hack in which, to quote,
In addition to the terms of the GNU General Public License, this
licence also comes with the added permission that, if you become
obligated to release a derived work under this licence (as per
section 2.b), you may delay the fulfillment of this obligation for
up to 12 months ("grace period"). If you are obligated to release
code under section 2.b of this licence, you are obligated to release
it under these same terms, including the 12-month grace period
clause.
This is an interesting idea but I have no idea if, in this case, companies would care at all: we're talking about tutorial and example code here, not real software.
I also found it extremely interesting to watch the dynamics between the free-as-in-beer and free-as-in-speech people. I'm currently willing to release software under either license -- I relicensed twill from GPL to BSD with the last release, for example -- but I have very little sympathy with the idea that companies should be able to take my code, close it, modify it, and resell it. Nonetheless I understand that competing ideologies exist and I'm willing to accomodate them as best I can. (Conveniently for my leanings both of the universities I work for, Caltech and MSU, demand that work-related software be released under the GPL.) Watching people consistently misrepresent their positions -- I assume they did so knowingly -- as "the GPL is free-er than the BSD!" and "the BSD is free-er than the GPL!" -- was very interesting and informative. (I would put it this way: the GPL restricts software use in specific ways, with the ultimate goal of increasing the freedom to use all derivatives of that software.)
Anyway, the list conversation got a bit long, and after I received a number of complaints about how annoying the list was becoming, I ended the discussion by fiat: I declared that we should either stop discussing licenses for 6 months, or we should move ongoing discussion to a new list. Enough people expressed interest that I created a new list, bip-admin, to contain further admin discussion for bio.scipy.org. (I realized later that meta-bip would have been a better name. Alas, renaming mailman lists is not trivial.)
One of the most frustrating things about the license discussion was that we really have no content whatsoever on bio.scipy.org, and here we were discussing how to handle all of this nonexistent content rather than writing some! I am both amused and horrified at the ability of people (including myself!) to talk about procedure and protocol endlessly while failing to actually do useful work. I guess it's the human condition -- heck, Og and Boog probably argued about the proper protocol for deciding whose turn it was to go get more firewood, back when we lived in caves...
--titus
posted at: 13:37 | path: /sep-07 | 2 comments
Thu, 27 Sep 2007
My SciPy '07 Talk
In the spirit of cleaning up my desktop... here's a PDF of my talk on Cartwheel at SciPy 2007.
--titus
posted at: 18:03 | path: /sep-07 | 0 comments
Slowly spreading my tentacles throughout Michigan State
I'm now listed on the Gene Expression in Disease and Development page, as well as on the CompSci faculty page, MicroMolecularGenetics faculty page, QuantBio page, and SysBio page.
It was quite a shock to log into the CompSci cluster at MSU and see my group set as "faculty". As a sysadmin, I've always thought of faculty as people that don't really use UNIX much; am I become them? shudder ;)
I've also formally put forward two classes, for my first year of teaching (starting fall '08). The first one is probably of more general interest:
Introduction to Database-Backed Web Development (CSE 291) Spring 2009 80 min lecture / 2 hr lab Prerequisites: CSE 232. Knowledge of Python is suggested but not required. The goal of this course is to introduce students to the theory and practice of database-backed Web site development. By the end of the course, students will have implemented a simple but complete interactive "Web 2.0" multiuser Web site in Python with asynchronous JavaScript (AJAX) features and an SQL database. Students will learn basic HTML, CSS, SQL, and JavaScript, while gaining an understanding of client-server programming, software architecture considerations, automated tests, and basic software carpentry (version control, source code management, and tools for collaboration). Graded work: weekly programming assignment and two short papers. Justification: database-backed Web development exposes students to a plethora of modern (and immediately relevant) technologies. Actually implementing a simple Web site will introduce students to modern network programming, client-server architecture, and software architecture design and deployment considerations -- all practical skills with deep underpinnings in computer science. I'm hoping to foster an increased awareness of effective programming tactics and skills with this course, as well as exposing them to a variety of technologies and theoretical considerations. Additional topics, if they can be worked in: usability consideration; statelessness; REST; scalability; OS/network stack; OS process/thread/event handling; the Semantic Web; remote APIs and RPC; trust networks; and social engineering.
I plan to use Quixote and ExtJS for this course: the former because it is simple to grok, and the latter because I like what I've seen of it.
The second course is more research-y:
Open Problems in Bioinformatics (CSE 491) Fall 2008 One 80 minute lecture, one 80 minute discussion. Prerequisites: graduate standing in science or engineering. (There's no way to make an effective prereq list.) This course will introduce biologists to computational considerations, and computational scientists to biological considerations, in the context of modern biological "grand challenges". Likely topics include genome-scale annotation, comparative and regulatory genomics, metagenomics, large-scale analysis of experimental data, phylogeny, gene and protein interaction networks, and machine learning techniques. The intention is to cross-fertilize interests and expertise, as well as expose students to considerations in large-scale data analysis and scientific intference. The course will be graded on attendance and participation, as well as a short presentation as part of a group. Additional potential topics: genome-scale alignments; RNAi/ncRNA; gene finding; assembly issues; whole-genome phylogenetics; protein structure; databases, data integration, and data warehousing.
Neither of these courses overlaps much with anything else offered at MSU (or anywhere else, AFAIK).
cheers,
--titus
posted at: 13:03 | path: /sep-07 | 3 comments
Wed, 26 Sep 2007
Code Layout and Version Control
So, a few people commented on my how to write Python code that doesn't suck post, and I thought I'd respond here rather than in the comments.
First, John Camara suggests adding the MIT license as an option. I chose the BSD because it's essentially equivalent to the MIT license, except for the no-attribution clause, which I think is pretty reasonable; read more here.
Next, John Dawson asks,
In the article, you said: "Note that you can always organize your actual files in as deep a hierarchy as you want, while keeping the public API shallow and easy to use."
How is this technique best accomplished?
I cover this in a bit more detail in the Advanced SWC section on packages, but the essential idea is that you import as many objects/functions/classes as you can in the top-level package's __init__.py. This makes them available as toplevel.symbol as opposed to toplevel.lower1.lower2.symbol, even if they actually reside in toplevel/lower1/lower2.py::symbol.
One counterargument to this approach is that you may not want to import "optional" sub-packages, i.e. packages that may not be used by everyone using your top-level package, as a matter of performance. I contend that (except in extreme cases) this is an issue of usability over performance, and I tend to weight those 80/20 (that is, usability is more important than performance, in general). So I choose better layouts over performance improvements.
Are there other reasons to keep symbols confined below the top level? I understand from Baiju's post that the Zope community has done so simply to manage the proliferation of names, which seems sensible. I like the idea, but until Baiju's post it always confused me a bit; maybe others haven't been confused and it's just me.
Finally, Gael Varoquaux asked me (in private e-mail) about version control. He was hoping that I had some killer text that would convince people to use version control, instead of (for example) having every file in the directory tagged by date.
Unfortunately, I don't have any really convincing text. I feel that I'm most convincing when I'm talking about the need for testing, although that's apparently so much less obvious than version control that my arguments still don't work well ;). (Yes, it's counterintuitive to me, but every serious programmer I know uses version control, while many of them don't write any automated tests at all!)
Moreover, I think -- at least in academia -- that switching approaches often comes down to a matter of ego. People don't want to change, because they think they know better than you. (This is as opposed to the somewhat more understandable reason of them being unwilling to spend the time to learn the new technique.) As you can probably tell from my frustration in the first and second drafts of the original article, I am in favor of submerging your own ego regarding code appearance; the same is true of tool use. I strongly believe that individual programmers usually don't know best and that they should be open to new approaches; the rare times when I listen to other people is when I make the most headway on my big problems! Overcoming ego demands a certain delicacy of approach.
What to say, then, in favor of version control? Forget the arguments that will appeal to experienced programmers - we're talking about people newly off the boat, so to speak. What we need are arguments that sidestep ego as much as possible while pushing the rewards of putting in the necessary effort to learn a VCS.
My arguments would be:
Using version control will let you figure out what has changed between two versions. This comes in particularly handy when you need to track down bugs, or figure out precisely what changes you've made since your last release.
Yes, you can keep dated copies of individual files, or even your entire archive, but version control will do this more quickly and easily and moreover provide a better interface to querying them.
Using version control through a public site like SourceForge or Google Code gets your code out there in the search engines and encourages people to collaborate with you. It's also a good way to build your resume: for example, I'm unlikely to consider hiring someone who has no verifiable open source project experience.
Using version control is a great way to quickly and automatically back up your code. Since most VCS explicitly support remote repositories, off-site backups are built into the code! You can even have Google or SourceForge back up your code for you, which is pretty dang convenient.
When you do get collaborators, version control is going to be necessary. Be optimistic -- plan ahead!
Comments welcome, either privately or on this post.
thanks,
--titus
posted at: 11:26 | path: /sep-07 | 5 comments