Mon 27 April 2015
C. Titus Brown
cos osf galaxy
Note - this was an internal funding request solicited by the
Center for Open Science. It's been funded!
Brief: We propose to integrate OSF into Galaxy as a data store. For
this purpose, we request 3 months of funding (6 months, half-time) for
one developer, plus travel.
Introduction and summary: Galaxy is a commonly used open source
biomedical/biological sequence data analysis platform that enables
biologists to put together reproducible pipelines and execute analyses
locally or in the cloud. Galaxy has a robust and sophisticated
Web-based user interface for setting up these pipelines and analyzing
data. One particular challenge for Galaxy is that on cloud instances,
data storage and publication must be done using local filesystems and
remote URLs, which adds a significant amount of complexity for
biologists interested in doing reproducible computing. Recently,
Galaxy gained a data abstraction layer that permits object stores to
be used instead of local filesystems. The Center for Open Science’s
Open Science Framework (OSF), in turn, is a robust platform for
storing, manipulating, and sharing scientific data, and provides APIs
for accessing such data; the OSF can also act as a broker for
accessing and managing remote data stores, on e.g. cloud
providers. Integrating the OSF’s object store into Galaxy would let
Galaxy use OSF for data persistence and reproducibility, and would let
Galaxy users take advantage of OSF’s data management interface, APIs,
and authentication to expand their reproducible biomedical science
workflows. This integration would also rigorously test and exercise
newly developed functionality in both Galaxy and the OSF, providing
valuable use cases and testing.
Our “stretch” goal would be to expand beyond Galaxy and work with
Project Jupyter/IPython Notebook’s data abstraction layer to provide
an OSF integration for Project Jupyter.
We note with enthusiasm that all groups mentioned here are robust
participants in the open source/open science ecosystem, and all
projects are full open source projects with contributor guidelines and
Broader impacts: If successful, the proposed project addresses several
broader issues. First, the OSF would have an external consumer of its
APIs for data access, which would drive the maturation of these APIs
with use cases. Second, the OSF would expand to support connections
with a visible project in a non-psychology domain, giving COS a
proof-of-concept demonstration for expansion into new communities.
Third, the Galaxy biomedical community would gain connections to the
OSF’s functionality, which would help in execution, storage, and
publication of biomedical data analyses. Fourth, the Brown Lab would
then be able to explore further work to build their Moore-DDD-funded
data analysis portal on top of both Galaxy and the OSF, leveraging the
functionality of both projects to advance open science and
reproducibility. Even a partial failure would be informative by
exposing faults in the OSF or Galaxy public APIs and execution models,
which could then be addressed by the projects individually. This
project would also serve as a “beta test” of the COS as an incubator
of open science software projects.
Longer-term outcomes: the Brown Lab and the COS are both interested in
exploring the OSF as a larger hub for data storage for workflow
execution, teaching and training in data-intensive science, and
hosting the reproducible publications. This proposed project is a
first step in those directions.