I just finished attending a 1-day workshop on Cyberinfrastructure for
Marine 'Omics down in DC. It was a meeting organized by the Gordon
and Betty Moore Foundation but attended by program managers from about
a dozen different agencies and divisions (NSF BIO, NSF GEO, etc.); a
bunch of pretty serious marine biogeochemists (?) also attended. And
The workshop was both really interesting and thought provoking, and
very depressing. The problems are hard, and the solutions aren't
An imperfect summary and set of thoughts, below.
Building "Cyberinfrastructure" for specific domains is hard
The term "cyberinfrastructure" encompasses hardware, networking, and
software across all levels - the metaphor used in the workshop was
"roads and cars", as in, without roads, cars aren't that useful;
without cars, roads aren't that useful. We need both.
Personally, I'm mostly focused on the question of "how do we build
useful software, fairly close to the research, that can help enable
domain experts to get their work done?"
The three approaches seem to be (warning, mild sarcasm below):
- give money to computational folk, who then build low level
infrastructure and focus on premature optimization of specific
workflows. End result: infrastructure that doesn't address
the actual problem.
- give money to domain scientists, who then use it to do research.
End result: software that does awesome stuff, but has unknown
generalizability and often can't be used by anyone else
(see: The Ladder of Scientific Software Notsuck)
- recognize that there are cross-domain-cutting problems that could
be addressed by coordinated funding, and try to leverage that
funding to actually address those cross-cutting problems. (I would
argue that metagenome assembly is one such problem shared by many
For examples of the 3rd, see iPlant and DOE KBase. My big problem with these is that they are not
run as open source projects so there's a lot of opaque development
focus, opaque money flow, and potentially wasted development. iPlant
(at least) has an open source mandate although I am
withholding judgment on reusability until they pay me a lot of money
to consult. Haha, no, seriously, I want to see someone else install
their stuff in the Amazon cloud. Then I'll believe them when they say
it's reusable ("trust, but verify").
A fourth way would be to embrace the Open Source/Open Platform Way and
start building kick-ass reusable components that could then be
combined into analysis pipelines by whomever wants to do so. As I
said during my talk, if the
biggest problem we have in 3 years is how to combine all the awesome
tools that are available, I will be a happy man...
This is also the point where companies could step in and make use of
the components (thus avoiding the CLC Workbench problem of "everything
is secret!") to build integrated pipelines. I can name a few companies
doing this in other areas.
This 4th way is what we would like to do with khmer. Hell, it's what we are doing
with khmer: one of the personally best things to see at the conference
was the public acknowledgment by Susan Gregurick of KBase that they
were using khmer somewhere in their metagenome assembly pipeline.
We need a new way to fund research software
Another thing I took away from the workshop was that we simply need a
new way to fund research software. As with my Dear Abby
post on hiring computational biologists, the domain specialists will
win in any reasonable funding competition, and the computational
people will be kicked to the curb.
Is it really that dire? I think so.
First, the funding cycles are slow and not particularly results driven.
It may take two tries to get a 3 year grant, and at the end of the
grant everyone will want to see a few papers. The question of whether
or not you actually built anything useful rarely comes up - it's all
measured by pubs.
Second, everyone wants to fund science, not software. The fact
that some science requires software does not escape funders, but if
you give an ocean geochemist reviewer a choice between a potentially
really cool set of experiments and a potentially really cool software
platform, I bet they'll choose the experiments 9 times out of 10.
Third, funding is flat or negative overall, so fairly frequently the
overall size of the pie has effectively decreased and the reviewers
are going to have to decide which of 5 awesome projects to cut. In
combination with #2, you can bet that new software stuff isn't going
to be funded. (Yes, given that increasingly science requires
software, this bodes ill for the future.)
Fourth, software is (in theory) too broadly useful, and everyone wants
to leverage other people's funding. Why would we fund 'omics
platforms for biogeochemistry from NSF Geo, when we need basically the
same thing over in NSF Bio? They should fund that! Bounce that
around enough and voila, nobody actually gets any initial funding to
Fifth, funding for building software is rarely available:
algorithms, yes. Data structures, yes. But implementations? Yawn.
(This is for a pretty good reason: scientists basically don't know
how to implement, either, and the salaries we can offer don't bring
in serious software engineers.) The discussion of the three-track
ABI program by Peter McCartney was awesome in this regard -- they
seem to be doing it at least partly "right".
Sixth, we lack senior people who know how to build software, which is
one reason it's not getting built. (I'm increasingly feeling like the
token "guy who knows about github" in these CI meetings :). I guess
that's a good change from 5 years ago when there wouldn't have been
anyone, but the fact remains that you need people who are domain
scientists and programmers in the room for these conversations.
So I'm not feeling particularly positive here.
Are there any solutions?
Dunno. Everyone seems enamored of the "let's throw more money at it,
oh, wait, we have no more money, drat" conversation, and a strong
recommendation of the workshop was to have another workshop to address
whether more workshops about having workshops would be useful, with
the end goal of having another workshop to produce a white paper that
would inform the next set of workshops. (You think I'm kidding, but
only slightly -- this wasn't actually the final recommendation,
because we recognized the absurdity. And, by the way, I'm getting
burnt out on workshops.) Best part of the conversation: when one of
the organizers said, "the point of this workshop isn't just to update
the recommendations from the 2007, 2009, and 2011 workshops. It's to
come up with a specific recommendation for the next workshop." heh.
There was a lot of discussion about how to get this on the NSF's
radar by holding community meetings. But, as one experienced
program manager pointed out, you simply end up robbing Peter
to pay Paul unless new money comes in. It's not clear anyone
will actually go for that, although I'm all for trying.
I think there are a lot of cheap things that could be tried by the
more experimentally minded, though.
I'd love to see a Sandpit in the area of "meta-omics". Sandpits are brainstorming
sessions to develop grant ideas, and I think one on components would
be really welcome. The more general idea is Coopetition -- get a bunch of
smart people in a room and let them figure out what should be done,
and the fund it. As long as the smart people don't actually like
each other that much, you can avoid collusion ;).
Another thought is that if the Moore Foundation really wants to
address some of the missing components, they need to fund
bioinformaticians. I know that some of the MMI investigators are
having a hard time with bioinformatics (I partly know this because I
am collaborating with some of them), and I'm increasingly critical of
the very concept of funding data gathering without concomitantly
funding data analysis. (Then again, this recognition is one reason
the workshop was being held, so I'm probably being too mean. But they
did reject me for MMI :) And yes, this is a self serving suggestion,
but it needs to be said.
I would particularly welcome "collaboration grants" where funding
agencies provide, say, a computational grad students' salary for some
period of time to work with an existing Moore investigator, so that
they can work on building more sustainable infrastructure centered on
a very specific biological problems. i.e. Take the specific and make
it more general.
Another idea -- it would be great to have rapidly proposed, rapidly
funded, and rapidly evaluated 6- or 12-month software projects. I'm
not sure if this could work well in isolation, but if you did this as
an open call to build and test components that both iPlant and KBase
could use (for example - maybe add Galaxy in here, too), then you'd
virtually be guaranteed to end up with something useful. My guess is
the money wouldn't be big enough to actually put in the required time,
but there's an obvious solution to that ;). And if you required that
the components be open source, well, at least you'd end up with some
Ginger Armbrust talked about a meeting that sounded like a great
hackfest-like workshop, where biologists and computational people got
together to work on problems. We did something similar for the HMP,
nucleated by Rob Knight and the NIH. More of this kind of communal
workshop would be great, because it makes the problems clearer on both
sides. (Sprints FTW.)
Building software at the interface of research and computation is
really hard, no doubt about it. People are leery of pouring more
money into what has so often been a failed enterprise. End Comment.
It's no surprise that resolving the tangle of academia, funding,
career incentives, and training in order to build software effectively
is seemingly intractable. When I put up my "other things I'm doing
slide" (see my talk, slide 7)
it struck me that I'm trying to address exactly this tangle. Maybe
obvious to people, but there you are. It's nice to find unifying
themes to one's work.
It was really nice to hear an NSF program manager complaining that
they don't like seeing fixed-term faculty on research grants, because
to the NSF it seems like the university isn't putting itself on the
line at all. Basically, if you hire somebody conditionally on them
finding grants to pay for themselves and all their resources, then
the NSF rightly has no faith that the university will support them
in any way if they run into trouble. I'd never thought about it
Program managers and funders really don't like hearing themselves
described as making top-down design decisions about computational
pipelines. I said something like that early on and then had to defend
it (privately) to about 5 different people. They did see my point,
though, when I explained it like this: if you have enough money for
only one project, and you write the RFP, find the reviewers, evaluate
the reviews, and pick one specific project, it's hard to argue that
this is a "bottom up" driven process. Their point is also well taken:
"top down" depends on where you are in the hierarchy of decision
making. I just happen to be at the bottom, which means that
everything is "top down" from my perspective :)
I asked for examples of successful CI, and people cited the Protein
Data Bank and
one other that I've now forgotten.
All in all, an interesting workshop. We'll see what happens. I look
forward to the next one :).
p.s. Update: one reader suggests that we not ask people what they
Money quote, "If I had asked people what they wanted, they would have
said faster horses." -- from Henry Ford.
There are comments.