DRAFT: A community-focused pre-publication data release and sharing policy for sequence data

This is a draft proposal of a policy to encourage pre-publication data release and data sharing within a community. This policy is based on discussions at the Cephalopod Genomics Workshop (a Catalysis workshop sponsored by NESCent).

Note, this is made available under a CC-BY-SA license permitting use and re-use with attribution. Although it's still a draft, we hope you'll find it useful.

Authors, in alphabetical order:

  • C Titus Brown
  • Brian Dilkes
  • Eric Edsinger-Gonzales
  • Robert M. Freeman, Jr.
  • Erich M. Schwarz

with contributions by members of the Cephalopod community, incuding Wendy Crookes-Goodson and Clifton Ragsdale.


Genomic and transcriptomic data is most useful in aggregate, and the interests of fostering community use and serendipitous discovery must be balanced by due concern for issues of publication, funding, and career recognition. We therefore propose a pre-publication data sharing policy and a minimal set of analyses that would drive "virtuous" data sharing behavior to encourage re-use, and provide the maximum benefit to the cephalopod community from people within and without the current community.


  1. A data policy that supports the rapid and broadest possible sharing of a subset of data, subject to significant restrictions on certain types of usage.
  2. A minimal set of automated analyses of submitted, raw data, including perhaps genomic contig assembly (not necessarily including scaffolding across repeats, het region collapse, etc., but just something quick and dirty) and RNA-seq assembly. These automated analyses could take advantage of all submitted data (e.g. co-assembling from multiple RNA-seq data types) but would be anonymized and 'flattened' so that proprietary (original) data cannot be reconstructed in order to protect the original lab which created the data. Links to more polished data sets can be added as they arise.
  3. A minimal (but potentially very useful!) set of user-based analyses, including most especially BLAST, such that visitors (who have agreed to the data sharing policy) are able to search for specific genes by sequence similarity across the database.
  4. Precomputed analyses of the aggregate data set for "obvious" considerations such as homology across species, searches against nr or RefSeq, etc. Underlying protocols / pipelines and settings will be made explicit.
  5. Bulk availability and download of all submitted data sets in support of aggregate reanalyses and assembly, but only by explicit agreement with the Cephalopod Genomics steering committee. Some mechanism will be implemented to notify the source lab of the persons/lab downloading the information so that potential collaborations can be created.


(...include example cases that are in/appropriate uses...)

We note that the Metazome site, for example, provides much of this functionality. GMOD and Maker might serve as a basic technical foundation for an independent Web site.


Specific language for the data sharing policy agreement:

For example, the JGI data release policy specifies immediate release of raw read data, assemblies, and automated annotations,


with a number of reserved analyses:


More specifically,

Reserved analyses include the identification of complete (whole genome) sets of genomic features such as genes, gene families, regulatory elements, repeat structures, GC content, phylogeny, etc., and whole-genome comparisons of regions of evolutionary conservation or change. (Studies of any type on the reserved data sets that are not in direct competition with those planned by the JGI and its collaborators may also be undertaken following an agreement to that effect. Interested parties are encouraged to contact the principal collaborator and JGI to discuss such possibilities.)

Note that data generated by US funding is subject to (at a minimum) the Fort Lauderdale agreement,


which "reaffirms a balance between fair use (i.e. no preemptive publication) and early disclosure."

Whichever data policy is ultimately adopted, we believe that it should be implemented as a "click through" license with username/IP address recorded, such that anyone from any community may agree to it and gain access to the data. Any data (and any derived analyses) posted on the site would then be automatically subject to this minimum data policy.

Some suggested phrasing:

"Our policy is that early release should aid the progress of science."


Legacy Comments

Posted by Titus Brown on 2012-05-27 at 11:35.

See:    https://docs.google.com/document/d/1EZdmveeFEyaiHKD_OuHNIOuRcs
QNO0p6y3gTyaDfR-U/edit    for a better writeup for a white paper.

Comments !