A week or two ago, I posted a crazy idea about crowdsourcing a bioinformatics analysis pipeline. I may still try to do that. But in the meantime, here's another crazy idea.
First, some background.
People are hungry for hard and interesting problems
I'm writing this as I fly back from PyCon 2013, where I hung out with a lot of people who seemed to be looking for challenging problems to tackle. I trace this back to a general discontent with solving problems like increasing the click rate on ads by 0.1%, as well as the realization that once you are comfortable and settled and making enough money, it's easy to get bored.
(Side note: we gave @hugs the TiP BoF award for a presentation on how he's resolving this conflict for himself by teaming up with @dabeaz to work on Maker stuff.)
I personally think a lot of these people should look into doing research, but there are a few problems with that. One is that most PhD programs -- which I believe is by far the best route to learning to do research -- have a residential requirement, require ~100% of your attention, and don't pay that well. The second is that a lot of PhD research is pretty abstract and appears irrelevant to the real world, and it's hard to pick projects when you're not really familiar with the area (a problem that new grad students have, too). A third is that research virtually requires a close working relationship with a small group of people, including someone who knows the lay of the land around the topic of interest and can serve as a guide.
More broadly, very few of these people even want a PhD. While I think the problems with residential PhD programs could be solved, I can't imagine that very many well-paid Python hackers want to quit their job to get a PhD. I'm skeptical that many companies would want to pay for or donate their developers' time for 20-40 hours a week to do a real PhD, too.
Is there a solution?
The Crazy Idea
I have a few really hard data analysis problems to solve. One of them, infinite metagenome assembly, is reasonably nicely encapsulated and bounded, and presents some interesting parallelism and data structure challenges. Moreover, we (me, my lab, and my collaborators) have lots of sample data and understand the basic problem pretty well. I submitted a grant on it, but in part because of straight up research funding challenges, and also because of the sequester, I'm unlikely to get the grant. But we still need to solve the problem. I've been thinking of working on it myself, but I'm increasingly unable to put in the time.
What if I crowdsourced the problem?
More specifically, what if I issued an open call for people to come "play" with us on this problem? I could set up a mailing list and an editable community doc project, as well as a repo with sample data sets and scripts and code (this is already available but not necessarily easy to work with). The goal would be to come up with a more detailed understanding of the problem, whatever novel data structures and algorithms that were needed, an implementation at scale, and a practical engineering solution with empirically shown "good" scaling behavior.
A statement of the problem
Imagine feeding the world's libraries into a paper shredder, mixing the output thoroughly, and then digitizing the textual contents of the shreds into strings (note that the process has errors, so each character will have an associated likelihood that that individual character is correct).
Now, reconstruct the source books by assembling the books "de novo", just from the content of the strings. Note a few important features: most books will be present in multiple copies (so you can rely on overlaps between the randomly fragmented strings); some books will be more abundant than others (so you can separate out the string fragments by abundance, potentially); and there will be different editions of books, so you may have that complication (which may be impossible to resolve 100% correctly). This is also using a restricted alphabet (A/C/G/T), note.
Implement an online streaming solution to this problem that can handle arbitrarily large amounts of data.
How would this work?
The goal would be to work collectively towards a solution that was well characterized, well implemented, and effective. I'd expect 1-2 years (but, if you think you're super smart, prove me wrong!!)
We would build out a basic documentation site with a statement of the problem, some tutorials to make use of existing tools, and some test data sets, all on github. We'd also set up a mailing list of some sort, and I'd commit to real interaction on the mailing list.
People who wanted to participate in the project would join the list and bash on problems cooperatively or coopetitively. Technical discussion would occur on the list, and I would encourage people to post summaries of on-list discussions and detailed potential solutions or thoughts on their own blogs, and then link them back to the documentation Web site. We could describe failures in gross detail, to make sure that people knew which paths had been explored.
Everyone would be welcome to work on the problem in isolation, and since the data sets are all public and there's no IP or licensing problems with our current products, anyone could publish or use the materials, either. I'd probably write up a quick paper describing the challenge and post it on arXiv, where it could be cited; the only requirement I would have is that you cite that paper when publishing.
People who worked closely with me and my group would, at the least, be co-authors on any publication that arose, and the git repo history and mailing list would serve as a history of intellectual and technical provenance.
Compute resources would be DIY, although for people who got involved enough I could supply access to more significant compute resources as needed. We could probably inveigle cloud providers to contribute resources for benchmarking and reputational purposes, too.
We'd probably need some mailing list rules to avoid trolls and people who wanted to argue about the problem rather than work on the problem, but I think slightly modified mailing list rules together with a self-confident project BDFL (yo!) would suffice.
What are the incentives?
Why would anyone participate?
Non-academic participants would get:
- the fun of solving a hard problem
- contributing to an advance in human knowledge, however small;
- major props in a tiny, irrelevant community;
- co-authorship (see above);
- training in biology lingo;
- an entree to a whole world of other interesting problems
Academic participants would get basically the same thing, although presumably co-authorship would be a tad more relevant to them.
What do I get?
- a reputational boost for trying something cool and being open about it;
- potentially a solution to a really annoying problem that is standing in the way of my collaborators' research;
- another publication;
- interaction with a bunch of potentially really cool people;
- did I mention a solution?
Is there more in it for me? Not really -- I don't get paid based on problems solved, so the most I would get is the reputational boost and the publication. IP and code licensing for anyone participating in the core cooperative project would need to be open, so I can't swipe your cool solution and sell it.
Other misc points
Depending on how things worked out, I can see this solving lots of sub problems in genome and transcriptome and metatranscriptome analysis.
I'd propose minimizing code churn for cleanup's sake -- our code isn't that awesome necessarily, but I would want to see people working on the problem itself rather than on the epiphenomena.
All our code is written in Python and C++. You can work in (e.g.) Ruby if you want, but you're going to have to rewrite a lot.
I'd welcome other research groups, provided they cite the paper in anything that comes out of it, and, if I/we contribute intellectually, mark it as a collaboration under the usual rules. In other words, the mailing list isn't a free source of publishable ideas -- you have to gasp put the names of people on it. But honestly I'd rather see a solution than my name on a paper ;).
Victory and max props can only be claimed with a real, implemented solution. We appreciate that you have special insight into a simple theoretical solution, and want to leave the implementation details to the software engineering plebes, but no, I'm not interested, thankyouverymuch.
Is this a novel approach to doing research?
I don't think it is, particularly, although I think trying to involve bored software engineers in a real research problem is not that common. I'm certainly aware of the Polymath Project and Zooniverse projects which have been really effective at crowdsourcing research problems. More generally, I really enjoyed Michael Nielsen's book, Reinventing Discovery: The New Era of Networked Science which is about this kind of stuff.
If there's anything novel about this, it's that I think it's a nicely packaged problem that can be explored by people with computing intuition but not a huge amount of biology specific background. But who cares? It's a problem I think needs to be solved.
I guess the first question is, should I bother? I could throw a project and have nobody come. It's a reasonably big chunk of time and effort to set this kind of thing up, and I have no idea if there'd be any real enthusiasm out there. On the flip side, it'd be a useful academic adventure anyway, since we generally lack good (hard, nasty) benchmarking data sets for metagenome assembly and analysis, and I can definitely provide some.
Another question is, is anyone out there willing to really work on such a hard problem? In fact, I'd bet it's too hard for anyone but an academic researcher to solve... :)
I'll sit on it for a bit and think about it some more. Any suggestions or comments for making it more attractive as a project, or more interesting, would be welcome.