Living in an Ivory Basement Stochastic thoughts on science, testing, and programming.

You can make GitHub repositories archival by using Zenodo or Figshare!

Update: Zenodo will remove content upon request by the owner, and hence is not suitable for long-term archiving of published code and data. Please see my comment at the bottom (which is just a quote from an e-mail from a journal editor), and especially see "Ownership" and "Withdrawal" under Zenodo policies. I agree with the journal's interpretation of these policies.

Bioinformatics researchers are increasingly pointing reviewers and readers at their GitHub repositories in the Methods sections of their papers. Great! Making the scripts and source code for methods available via a public version control system is a vast improvement over the methods of yore ("e-mail me for the scripts" or "here's a tarball that will go away in 6 months").

A common point of concern, however, is that GitHub repositories are not archival. That is, you can modify, rewrite, delete, or otherwise irreversibly mess with the contents of a git repository. And, of course, GitHub could go the way of Sourceforge and Google Code at any point.

So GitHub is not a solution to the problem of making scripts and software available as part of the permanent record of a publication.

But! Never fear! The folk at Zenodo and Mozilla Science Lab (in collaboration with Figshare) have solutions for you!

I'll tell you about the Zenodo solution, because that's the one we use, but the Figshare approach should work as well.

How Zenodo works

Briely, at Zenodo you can set up a connection between Zenodo and GitHub where Zenodo watches your repository and produces a tarball and a DOI every time you cut a release.

For example, see https://zenodo.org/record/31258, which archives https://github.com/dib-lab/khmer/releases/tag/v2.0 and has the DOI http://doi.org/10.5281/zenodo.31258.

When we release khmer 2.1 (soon!), Zenodo will automatically detect the release, pull down the tar file of the repo at that version, and produce a new DOI.

The DOI and tarball will then be independent of GitHub and I cannot edit, modify or delete the contents of the Zenodo-produced archive from that point forward.

Yes, automatically. All of this will be done automatically. We just have to make a release.

Yes, the DOI is permanent and Zenodo is archival!

Zenodo is an open-access archive that is recommended by Peter Suber (as is Figshare).

While I cannot quickly find a good high level summary of how DOIs and archiving and LOCKSS/CLOCKSS all work together, here is what I understand to be the case:

Digital object identifiers are permanent and persistent. (See Wikipedia on DOIs)
Zenodo policies say:

"Retention period

Items will be retained for the lifetime of the repository. This is currently the lifetime of the host laboratory CERN, which currently has an experimental programme defined for the next 20 years at least."

So I think this is at least as good as any other archival solution I've found.

Why is this better than journal-specific archives and supplemental data?

Some journals request or require that you upload code and data to their own internal archive. This is often done in painful formats like PDF or XLSX, which may guarantee that a human can look at the files but does little to encourage reuse.

At least for source code and smallish data sets, having the code and data available in a version controlled repository is far superior. This is (hopefully :) the place where the code and data is actually being used by the original researchers, so having it kept in that format can only lower barriers to reuse.

And, just as importantly, getting a DOI for code and data means that people can be more granular in their citation and reference sections - they can cite the specific software they're referencing, they can point at specific versions, and they can indicate exactly which data set they're working with. This prevents readers from going down the citation network rabbit hole where they have to read the cited paper in order to figure out what data set or code is being reused and how it differs from the remixed version.

Bonus: Why is the combination of GitHub/Zenodo/DOI better than an institutional repository?

I've had a few discussions with librarians who seem inclined to point researchers at their own institutional repositories for archiving code and data. Personally, I think having GitHub and Zenodo do all of this automatically for me is the perfect solution:

quick and easy to configure (it takes about 3 minutes);
polished and easy user interface;
integrated with our daily workflow (GitHub);
completely automatic;
independent of whatever institution happens to be employing me today;

so I see no reason to switch to using anything else unless it solves even more problems for me :). I'd love to hear contrasting viewpoints, though!

thanks!

--titus

There are comments.

social