If you haven't seen mybinder.org, you should
go check it out. It's a site that runs IPython/Jupyter Notebooks from
GitHub for free, and I think it's a solution to publishing
reproducible computational work.
For a really basic example, take a look at my demo Software
Carpentry lesson. Clicking
on that link brings up a Python notebook that will load and plot some
data; it's interactive and you can go ahead and execute the cells.
There are a couple of really appealing things about this.
First, it's simple -- the source repository is just a
directory that contains a Jupyter notebook and some data. It's pretty
much what you'd have if you were actually, you know, doing data
analysis with data in a repo.
Second, it's built on top of solid technology. mybinder's
foundation is Docker and Jupyter Notebook, both of which are robust and have large,
supportive communities. Sure, behind the scenes, mybinder is using
some bleeding edge tech (Kubernetes
running on Google Compute Engine) to provision and load balance
the compute. But I bet it would take me no more than 30 minutes to
cobble together something that would let me run any
mybinder-compatible repository on my laptop, or on any compute server
that can run Docker.
Third, it's super configurable -- because it's based on Docker, you
can install pretty much anything you want. I tested this out myself
by installing R and an R kernel for the Jupyter notebook -- see this
mybinder link and repo. The Dockerfile
that installs things is pretty straightforward (given that installing
all that stuff isn't straightforward in the first place ;).
Fourth, it's open source.
Go ahead, do what you want with it!
I believe mybinder is the hitherto missing link for publishing reproducible
computation. Before this, we could give people our data and our data
analysis scripts, but not our compute environment - or at least not in
a way that was easy to run. We can now provide readers with a link
that they can use to instantly go from my paper to a running
instantiation of my paper's data analysis.
What's missing and easy?
Authentication and Cost: right now, mybinder is running on
someone else's dime -- specifically, the Freeman Lab's dime. This can't continue
forever, especially since it's totes open for abuse. I expect that
sooner or later someone will gin up a way to run this on many cloud
compute services; once that works, it should be pretty easy to
build a site that redirects you to your favorite compute service
and lets you enter your own credentials, then boom there you go.
(I think current mybinder limitations are in the several GB of RAM
range, with a few dozen MB of disk space.)
Executing anything other than GitHub public repos: mybinder only
runs off of public GitHub repositories right now, but that's not going
to be too tricky to fix, either. BitBucket and GitLab and private
repos can be served almost as easily.
RStudio and RMarkdown: currently, it's all Project Jupyter, but
it should be pretty easy to get RStudio Server or some sort of
RMarkdown application (Shiny?) working on it.
Pre-built Docker images: you need to derive your Dockerfile from
the andrewosh/binder-base image, which means you can't use
pre-built images from the Docker Hub (even if they were built from
binder-base). I predict that will go away soon enough, since it's not
a fundamental requirement of the way they're doing things.
What's missing and not easy?
Big files and big/long compute: Coordinating the delivery of big
files, big compute, and big memory (potentially for a long period of
time) is something that's out of easy reach, I think. Docker isn't
ready for really data-intensive stuff, from what I can tell. More,
I'm not sure when (if ever) people will completely automate the
heavyweight data analysis - surely it's further away than most people
using Jupyter or RStudio for data analysis.
The split that my lab has made here is to use a workflow engine
(e.g. make, pydoit, or snakemake) for the compute & data intensive
stuff, and then feed those intermediate results (assembly and mapping
stats, quantification, etc.) into analysis notebooks. For mybinder
purposes, there should be no problem saving those intermediate results
into a github repo for us and everyone else to analyze and reanalyze.
What are the next steps?
I bet this will evolve pretty fast, because it's an easy way for
scientists to deliver compute to friends, colleagues, and students.
In terms of learning how to make use of mybinder, Software Carpentry already teaches scientists how to
do much of the necessary stuff -- scripting, git, and Jupyter
Notebooks are all part of the standard curriculum. It's not a big
leap to show students some simple Docker files. There's virtually no
unnecessary configuration required at the moment, which is a big win;
we can talk about why everything is actually important for the end
result, which is <ahem> not always the case in computing.
Beyond that, I'm not sure where the gaps are - as usual we're missing
incentives for adopting this, so I'd be interested in thinking about
how to structure things there. But I look forward to seeing how this
does something similar to mybinder. I'm aware of it, I just haven't
had the chance to try it out - I'd love to hear from people who have used
p.p.s. For a real, and super cool, example of mybinder, see: Min
Ragan-Kelly's LIGO/gravitational waves tutorial.
There are comments.