I just left the NAS meeting on Integrating Environmental Health Data
to Advance Discovery,
where I was an invited speaker. It was a pretty interesting meeting,
with presentations from speakers who worked on chemotoxicity data,
pollution data, exposure data, and electronic health records, as well
as a few "outsiders" from non-environmental health fields who dealt
with big data and data integration issues. I was one of the outsiders.
When Carolyn Mattingly spoke with me about coming to this meeting, I
was more than a bit hesitant, because (a) it's not my field, (b) I
mostly work with lots of rather homogeneous sequencing data so am not
an expert in heterogeneous data, (c) the
integration front is a bit of a horror show in bioinformatics, and (d)
I didn't think many people, anywhere, had done a good job of setting
up structures for solving this problem. So I wasn't really sure what
I could say. Carolyn nonetheless convinced me that I might have
something to add (the paraphrase I remember is "you admitted ignorance
more quickly than most, which seems valuable") and bulldozed me into
coming.
(I'm glad I attended!)
I took about a month to prepare, and I ended up deciding to talk about
what kinds of software architecture you might want to invest in for
the long term. I rephrased my technical goal for the meeting like so:
How can we best provide a platform or platforms to support flexible
data integration and data investigation across a wide range of data
sets and data types in biology?
with the following specific goals:
- Avoid master data manager and centralization
- Support federated roll-out of new data and functionality
- Provide flexible extensibility of ontologies and hierarchies
- Support diverse "ecology" of databases, interfaces, and analysis software.
In response to these goals, I looked for success stories outside of
biology, and in particular for domains and entities
- with really large amounts of heterogeneous data,
- that are continually increasing in size,
- are being effectively mined on an ongoing basis,
- have widely used programmatic interfaces that support "mashups" and other cross-database stuff,
- and are intentional, with principles that we can steal or adapt.
and found Amazon [0]. I was already aware of Amazon's approach from
Steve Yegge's Platform Rant,
and I found a summary of Amazon's architecture (based on interviews
and such) at HighScalability. They have been
demonstrably successful at managing 50 million+ customers, billions of
reviews, hundreds of software services, and basically complexity that
is unimaginable from my perspective. So what I discussed was how, from
a software architecture perspective, taking this kind of platform view
was probably the right way forward.
I peppered the talk with entertaining slides, 'cause walls-o'-text
never works that well and I couldn't use my normal research figures.
I included a view of the booksthatmakeyoudumb.virgil.gr mashup, the XKCD "standards"
comic, and the tree-swing design slide. Conveniently I was preceded by Stephen Friend (of SAGE), who
made use of oft-wildly irrelevant slides to hold the audience's
attention, and aftr me Matt Martin (from the EPA) put up a GI Joe
slide to make the point that "knowing is half the battle"; my speaking
approach wasn't actually that out of the ordinary :).
I think the talk went OK, with only one major catastrophe (more about that
in my other blog post); a number of people claimed to find it interesting,
at any rate, which is better than being ignored and silently hated!
For the talk itself, you can see my slides here,
and everything was recorded. I'll update the post to point to the
transcript and video if and when they are made available.
As I watched the rest of the workshop, though, I realized that I had
stumbled into something deeper than a set of flippant comments that
offered a slightly different perspective on the issues at hand. I
think I may have converted myself into believing, wholeheartedly, that
the platform view is what we should be doing, and, moreover, that I
want to work on it myself, and, more moreover, that it may be the only
successful way forward for the long term.
Stripping Yegge's post of its irreverence, he claimed that Amazon's
CEO, Jeff Bezos, had mandated the following:
1. All teams must expose data and functionality solely through a
service interface (a service-oriented architecture).
2. All communication between teams happens through that service
interface (no exceptions!).
3. All service interfaces must be designed so that they can be exposed
to the outside world (externalizable).
A more succinct way of putting it is this: design your service by
dogfooding it -- design the service you want to use, but design
it for others; and then use it yourself.
The translation of this into science is that if you want to be a data
provider, you should provide data via an interface that is rich enough
for you yourself to use as a data consumer. Since, in science, most
data producers and data providers are also consumers of that data,
this would ensure that at least some of the data would be usable by
someone real. This is directly analogous to the agile software
development approach, in which you design the software with the
end-user in mind, with lots of little iterations to make sure that
you don't deviate too far from a useful course.
In contrast to a service, many (most?) people in biology today provide
data in one of two ways, via a Web GUI or direct data download. Web
GUIs usually ration the data out piecemeal, via a hard-to-automate but
easy-to-use Web site. Direct data download is the opposite end of the
spectrum: you can get straight-up file download of everything, but
making use of it in any way requires some amount of scripting or
loading and processing. When service APIs to query and probe subsets
of the data are provided, anecdotes from the DAS folk and others
suggest that people tend to abuse them heavily.
Why would you want to provide service interfaces? Apart from
architectural and software engineering reasons, the main positive
reason, IMO, is to make serendipitous mashups as easy as possible.
Mashups are Web sites and services where different data sets are
"mashed" together in such a way as to provide novel functionality; for
example, books that make you dumb "mashed" together
Facebook's "top 10 books" by college lists with a list of average SAT
scores by college to figure out which books were correlated with low
and high SAT scores. There's no way that Facebook predicted that use,
but because the data for both were available, Virgil Griffiths could
smush the data together to make a new service. (Note that Virgil is
also the person behind wikiscanner, which correlates changes to
Wikipedia articles by originating domain.) These kinds of mashups
are basically the same thing as what scientists want to do with
heterogeneous data sets: no one can predict every possible query,
or design their Web site to cover more than the most common use cases.
Why do we need service interfaces? Service interfaces are basically
remote procedure calls that let you query, manipulate, subset, and
download information and data from remote Web sites. You can think of
them as libraries of remote functions and data. In contrast to
service interfaces, Web GUIs tend to channel queries in very specific
directions, and there's a lot of user experience, security, and
authentication involved in building them. Data downloads go the other
way: to make the data useful, you often need to load in multiple data
files and do your own correlation and querying. At their worst, service
interfaces can either be as specific as Web GUIs (but without the nice
interface), or as difficult as data downloads ("here's all my data;
have fun") but with higher latency. But I bet that for most data sets,
there is a broad range of situations where a service interface would
both be considerably more flexible than an HTML Web site
The idea of providing service interfaces is not new by any means.
NCBI and ENSEMBL actually already provide them; their Web sites are,
in some ways, a thin layer over a robust and complex service API.
Plenty of toolkits make use of them. I just think that, in an era
of large amounts of wildly heterogeneous data, everybody should
be doing things this way.
The more basic point I want to make is this: Amazon and other
companies have to deal with similar (or larger) issues of data size,
data integration, data access, trust, and security as do scientists.
Amazon has settled on a platform architecture, or so it seems; they're
clearly very successful, and role out new services with a
frustratingly casual saunter; they have a set of ground rules and
experiences that can be conveyed in techie-speak; and it would be
stupid of us to ignore them. Just as computational science is
discovering open source, agile programming, version control, and
testing (among other things), I think we should be paying attention to
distributed software architectures that demonstrably can function at
the scale we need.
How practical is this vision?
Not very, in some ways. Scientists are notoriously badly trained in
software engineering, and I don't think computer science researchers,
broadly speaking, have distinguished themselves as particularly good
programmers, either. On the other hand, scientists are really good at picking
up new things (you might call it an occupational requirement ;) and
given the right motivation, I think it could work.
One path towards practical realization is to pick a loosely defined
field, be it environmental health science or genomics, and work to
establish a community of practice within that field. The rule would
be that anyone seeking new funding for large data production or
provision would have to explain how they were going to make all their
data available via a service interface; this could be provided via
supplemental funding for existing grants, too. Any group that
obtained this funding would have to hire a techie who would be
responsible for interacting with the other techies at regular
meetings. Grants would have to include use cases for their data, and
each group would be regularly evaluated on progress towards addressing
those use cases.
The other side of the coin would be data consumers, who would need
to be trained. Biologists, for example, are used to dealing with
things via a Web interface; but I think Python libraries combined with
an IPython Notebook install in the cloud would make it possible to
quickly and easily teach scientists to work with single or multiple
remote services. Workshops at domain conferences, online help desks,
remote tutoring, and all the rest could be brought to bear to help
them through that initial steep learning curve. More advanced
workshops could focus on building new services that combined data from
producers into mashups that then became new data producers, and people
who wanted to provide specific Web interfaces built on top of some or
all of it would be free to do so.
What are the advantages of this approach?
I think there are a few key advantages.
First, I think we can actually train scientists in both building and
using service APIs. Over at Software Carpentry, Greg Wilson has been
complaining that the only thing we can teach scientists to do with
their data is either build static Web sites or dynamic sites with
massive security holes. I think with relatively little effort we
could make fairly generic Web services for columnar data (think Excel
spreadsheets), and, with only a little more effort, show them how to
provide subset and query ability on that data. Integrating other data
sets would be harder, of course, but in many ways not much harder than
building their own SQL database -- which is something we already
purport to teach.
Second, cloud infrastructure makes it easy for groups to build
services that they don't have to actually host themselves; we could
provide common mechanisms for hosting for the field, but leave
scientists to implement their own software.
Third, new services and functionality can be rolled out in a federated
manner -- there is no "master data manager", no centralized site that
you need to argue with to include your data or modify their
functionality. As long as the APIs are versioned (which is key) then
a tenuous stability can be achieved.
Fourth, it's language independent, thank goodness. You Java, Perl and
Ruby folk can go your misguided ways (note: I am a Pythonista). R
would make a perfectly good client, and services including R could
enable good service provision for running statistical approaches and
data visualization. All communication would work via HTTP, which I
think is fast becoming the one true language that everyone actually
has to speak!
Fifth, it's just good software practice. Modularity, separation of concerns,
easy testability, and an acknowledgement that change is inevitable, are all
things that we expect in big systems.
Sixth, enabling interoperability between services would enable a
software ecology, which is desperately needed. Rather than two
duelling databases with non-overlapping data, we could have many
duelling databases with overlapping or complementary data -- and some
of them would die off, while some of them would succeed. One might
almost call it a meritocracy, where the most useful survive. Usage
stats could be used to motivate and drive maintenance funding, while
competitive funding opportunities could be used for extension requests.
Seventh, it provides a formal distinction between the service
providers and the service consumers. One of my favorite moments
during Matt Martin's talk on providing toxicity data via a Web
services was when one of the audience members asked him if he could
provide a certain custom view on his dashboard app. An appropriate
respone to this would be "no, ma'am, my job is to give you the data
necessary to do that analysis; it's your job to actually do it." With
tools like IPython Notebook and RStudio, Matt could even give remote
users a set of example views that they could then tweak, modify, and
remix to their own heart's content -- this would be far more enabling
than a straight Web interface could be.
What are the disadvantages of this approach?
A platform approach still requires software development -- if
anything, quite a bit more of it, by more people, than a centralized
approach would. I don't think there's a way around this, but at least if we
get everyone on the same page, we can establish a community of
practice and help each other out by finding common patterns and
problems.
We haven't solved any really hard problems with this, either.
Ontologies and standards are still needed, although they can be more
"casual agreements" than "standards", and, more importantly, they can
be based on real direct immediate use.
The main objection that will be raised, I bet, is that shipping large
amounts of data around via service APIs is inefficient and nasty. In
response I will say that I think Amazon and Google have made the case
that, past a certain scale, storing massive amounts of data in one big
database is also a really bad idea. It behooves us to figure this out
now. Even for the massive sequence collections I'm dealing with, I
think this is solvable by providing a richer or deeper set of query
functions, or perhaps just telling people that you need a really
compelling use case to justify the specific development effort needed.
How is this different from every other distributed/remote/blahblah out there?
Most other efforts of which I'm aware have focused on a single problem
or layer. "I'm going to be the world expert on genome intervals", for
example. Or "I'm going to be the flexible middleware layer that
enables generic communication." But I haven't seen many properly
separated-out vertically integrated blobs of functionality. So
perhaps the combination is different -- separated blobs of lightweight
functionality, intended to provide useful mashups of data via generic
HTTP interfaces, combined with vertically integrated examples and
demos in interactive notebooks like RStudio and IPython Notebook.
It's entirely probable that someone has done this, and I just don't
know about it. I'd love to hear about it, and success or failure stories.
And how would you get started, Dr. Brown?
On the off chance that someone flung a few $100k at me (anyone?
anyone? Bueller?), I would go out and find three databases that wanted
to work together and had good complex use cases, hire one developer at
each + one local to me, and get to work. First up would be to
implement some lightweight use cases to iron out the kinks and get
everyone working together. Then, I would proceed with a series of
small iterations, each time working towards addressing a new use case
or two. Initially, my developer would be in charge of building
independent libraries and third-party tests, as well as trying out
various mashups; as the collaboration proceeded, my developer would
move towards scalability and security testing, identifying common
needs and development patterns. At that point, other databases and
data sets and use cases could start to be considered; end users could
also start to be trained on the interfaces, and could come up with
their own use cases for which new functionality was needed. The
overall goals would be to provide an example set of services, as well
as a set of practices that seemed to work well for this set of use
cases, and -- perhaps most importantly -- a group of people with
experience and expertise in this way of working.
--titus
[0] Disclaimer: Amazon occasionally tosses small research grants my
way, and a family member works rather high up for 'em, as does a
student of mine. That doesn't entitle Amazon to any slack from me,
though.
There are comments.