The 2017 binder workshop!

tl;dr? We ran a workshop on binder. It was fun!

workshop attendee photo

What is binder?

Imagine... that you are visiting the data repository for a preprint you are reviewing, and with the click of a button you are brought to a fully configured RStudio Server containing that data.

Imagine... you are running a workshop, and you want to introduce everyone in the workshop to a machine-learning approach. You give them all the same URL, and within seconds everyone in the room is looking at their own live environment, copied from your blueprint but individually modifiable and exportable.

Imagine... your lab has a collection of standard data analysis protocols in Jupyter Notebooks on your GitHub site, and anyone in your lab can, with a single click, bring them to life and run them on a new data set.

Binder is a concept and technology that makes all of the above, and more, tantalizingly close to everyday realization! The techie version is this: currently,

  • upon Web request, binder grabs a GitHub repository, inspects it, and builds a custom Docker image based on a variety of configuration detection;

  • then, binder spins up a Docker container and redirects the Web browser to that repo;

  • at some point, binder detects lack of activity and shuts down the container.

All of this is (currently) done without authentication or payment of any time, which makes it a truly zero configuration/single-click experience for the user.

Just as important, the binder infrastructure is meant to be widely distributed, reusable, and hackable open source tech that supports multiple deployments and customization!

The workshop!

In 2016, I wrote a proposal to fund a workshop on binder to the Sloan Foundation and it was funded!! We finally ran the workshop last week, with the following organizing committee:

Why a workshop?

Many people, including myself, see massive potential in binder, but it is still young. The workshop was intended to explore possible technical directions for binder's evolution, build community around the binder ecosystem, and explore issues of sustainability.

One particular item that came up early on in the workshop was that there are many possible integration points for binder into current data and compute infrastructure providers. That's great! But, in the long term, we also need to plan for the current set of endeavors failing or evolving, so we should be building a community around the core binder concepts and developing de facto standards and practice. This will allow us to evolve with endeavors as well as finding new partners.

So that's why we ran a workshop!

Who came to the workshop?

The workshop attendees were a collection of scientists, techies, librarians, and data people. For this first workshop I did my best to reach out to people from a variety of communities - researchers from a variety of disciplines, librarians, trainers, data scientists, programmers, HPC admins, and research infrastructure specialists. In this, we somewhat succeeded! We didn't advertise very widely, partly just because of a last minute time crunch, and also because too many people would have been a problem for the space we had.

As we figure out more of a framework and sales pitch for binder, I expect the set of possible attendees to expand. Still, for hackfest-like workshops, I'm a big fan of small diverse groups of people in a friendly environment.

What is the current state of binder?

The original mybinder.org Web site was created and supported by the Freeman Lab, but maintenance on the site suffered when Jeremy Freeman moved to the Chan-Zuckerberg Initiative and became even busier than before.

The Jupyter folk picked up the binder concept and reimplemented the Web site with somewhat enhanced functionality, building the new BinderHub software in Python around JupyterHub and splitting the repository-to-docker code out into repo2docker. This is now running on a day-to-day basis on a beta site.

A rough breakdown, and links to documentation, follow:

JupyterHub - JupyterHub manages multiple instances of the single-user Jupyter notebook server. JupyterHub can be used to serve notebooks to a class of students, a corporate data science group, or a scientific research group.

Zero-to-JupyterHub - Zero to JupyterHub with Kubernetes is a tutorial to help install and manage JupyterHub.

BinderHub - BinderHub builds "binders" containing data+code from GitHub repos and then serves the binders in a custom computing environment. beta.mybinder.org is a public BinderHub.

repo2docker - repo2docker builds, runs, and pushes Docker images from source code repositories.

Highlights of the binder workshop!

What did we do? We ran things as an unconference, and had a lot of discussions and brainstorming around use cases and the like, with some super cool results. The notes from those are linked below!

A few highlights of the meeting deserve, well, highlighting --

  • Amazingly, we got to the point where binder ran an RStudio Server instance, started from a Jupyter console!! Some tweets of this made the rounds, but it may take a few more weeks for this to make it into production. (This was based on Ryan Lovett's earlier work, which was then hacked on by Carl Boettiger, Yuvi Panda and Aaron Culich at the workshop. I have it on authority that Adelaide Rhodes asking lots of questions by way of encouragement ;).

  • Everyone who attended the workshop got to the point where we had our own BinderHub instance on Google!! (We used these JupyterHub and BinderHub instructions). w00000t! (Session notes)

  • Yuvi Panda gave us a rundown on the data8 / "Foundations of Data Science" course at UC Berkeley, which uses JupyterHub to host several thousand users, with up to 700 concurrent sessions!

We came up with lots of use cases - see ~duplicate set of notes, here.

Other stuff we did at the workshop

(All the notes are on GitHub, here)

Here is a fairly comprehensive list of the other activities at the workshop --

Issues that we only barely touched on:

  • "I have a read only large dataset I want to provide access to for untrusted users, who can do whatever they want but in a safe way." What are good practices for this situation? How do we provide good access without downloading the whole thing?

  • It would be nice to initiate and control (?) Common Workflow Language workflows from binder - see nice Twitter conversation with Michael Crusoe.

  • How do we do continuous integration on notebooks??

  • We need some sort of introspection and badging framework for how reproducible a notebook is likely to be - what are best practices here? Is it "just" a matter of specifying software versions etc and bundling data, or ...??

Far reaching issues and questions --

  • it's likely that the future of binder involves many people running many different binderhub instances. What kind of clever things can we do with federation? Would it be possible for people to run a binder backend "close" to their data and then allow other binderhubs to connect to that, for example?

  • Many issues of publishing workflows, provenance, legality - notes

  • It would be super cool if realtime collaboration was supported by JupyterHub or BinderHub... it's coming, I hear. Soon, one hopes!

Topics we left almost completely untouched:

What's next?

I'm hoping to find money or time to run at least two more hackfests or conference -- perhaps we can run one in Europe, too.

It would be good to run something with a focus on developing training materials (and/or exemplary notebooks) - see Use Cases, above.

I'm hoping to find support to do some demo integrations with scholarly infrastructure, as in in the Imagine... section, above.

If (if) we ran a conference, I could see having some of the following sessions: - A hackfest building notebooks - A panel on deployment - keynote on the roadmap for binder and JupyterHub - Some sort of community fest

If you're interested in any of this, please indicate your interest in future workshops!!

Where to get started with binder

There are lots of example repositories, here:

github.com/binder-examples

you can click "Launch Binder" in any of the READMEs to see examples!


There is a gitter chat channel that is pretty active and good for support: see gitter.im/jupyterhub/binder


And, finally, there is a google groups forum, binderhub-dev

Some other links worth mentioning:

  • nbflow - one-button reproducible workflows with Jupyter Notebook and Scons (see video).
  • papermill - parameterize, execute, and analyze notebooks.
  • dataflow - a kernel to support Python dataflows in the Jupyter Notebook environment.

  • an example of how to use the new postBuild functionality to install jupyter notebook extensions.

aaaand some notes from singularity:

One way to convert docker images to singularity images, using docker2singularity

docker run -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/image:/output \
    --privileged -t --rm singularityware/docker2singularity ubuntu:14.04  

Another way to simply run docker containers in singularity:

singularity exec docker://my/container <runcommand>

The End

I have no particular conclusion other than we'll have to do this again!

--titus

Comments !

(Please check out the comments policy before commenting.)