Mon 17 February 2014
C. Titus Brown
open data f-yeah
Note: updated 2/18 with Benton Gravely's name -- he did the squid
A few months back, I announced
the khmer protocols project, an
effort to write down an explicit, open protocol for transcriptome and
metagenome assembly. This project was started during the summer of
2013 at the Woods Hole Marine Biological Lab, in
collaboration with Joshua Rosenthal.
I'd met Josh at the CephSeq meeting the
year before, and we reconnected at MBL in 2013; there, motivated in
part by discussions at CephSeq, I started writing the eel-pond
During the summer, Josh and I found that we were both incredibly
frustrated by the general failure of researchers to share data pre-pub, and so
we started to chat seriously about some ideas we'd independently had
about incentivizing the opening of transcriptome data. This
eventually led to
the Open Marine Transcriptome project,
which is still in its formative stages.
However, in the meantime I am happy to report that Josh assembled a
bunch of squid transcriptomes and is willing to make them available to
all. In addition, Josh did some skim Illumina sequencing (~40x
coverage) of the squid genome, and my lab assembled it; we are making
a really basic draft genome available as well. (Here, by squid, we
mean Loligo pealeii, or the
Longfin inshore squid -- "loligo"
for short. It is also sometimes called "Doryteuthis pealeii".)
It should be noted that one of the transcriptomes was made from
carefully dissected Giant Fiber Lobe neurons. These are the cell
bodies for the well-studied squid giant axon. Thus this transcriptome
should provide "molecular support" for all the cell biology and
neurophysiology that has focused on this most famous of preps.
In addition, Josh and Brenton Graveley of the University of
Connecticut Health Center did some skim Illumina sequencing (~40x
coverage) of the squid genome, and my lab assembled it a basic
We are pleased to announce the availability of this data, in two formats.
First, you can download the assembled and annotated transcriptome data
in FASTA format:
as well as the (unannotated) genome assembly:
Loligo pealeii genomic
Second, we have
also made available a public BLAST server where you can search the individual transcriptomes, a database of the concatenated transcriptomes, and the genomic sequence. (The BLAST server is not on a particularly powerful machine so I may turn it off if thousands of squid enthusiasts start pounding on it ;)
The various tissues were dissected by Josh Rosenthal and sequenced
using an Illumina HiSeq. Raw data and details will be made available as
part of a data paper. Assembly and annotation was done per
the khmer protocols v0.8.4, with the exception of using the uniprot database for annotation instead
The genomic data was similarly sequenced with a HiSeq by Benton
Gravely and was then subjected to a variant of the
metagenome protocol: specifically,
we did three-pass digital normalization and then assembled with
Velvet. This will be described in more detail in the data paper.
Why are you making these data available?
We love cephalopods (see
this great writeup of why, and we think that making this data available will
accelerate cephalopod research. You can read more about our general
How should I use them?
Heck if we know. Impress us!
How should I cite them?
We will post a citation handle to
shortly. If you desperately need to cite us, please contact us a few
days in advance of when you need to cite ;).
What if the transcripts or contigs are wrong?
Every assembly is at least a little bit wrong, but we're reasonably
confident that our assembly approaches work OK. We've also tried to
maximize sensitivity of the genome assembly, at the expense of contig
length; we believe this means that misassemblies are also somewhat
That having been said, it is your responsibility to validate your
own analyses. Caveat emptor.
You may, of course, also pretend that the data is not available and then
not use it for anything.
What data sharing license are these data under?
We're releasing these data under
Creative Commons 0, the most
liberal of the Creative Commons licenses. This is a public domain
dedication: do with the data what you will. If you publish something
based on this data, it is standard academic practice to cite us (see
Hey, can I repost these on my own site?
Sure thing. Just remember that standard academic practice is to cite
origin of the data, so even if you do something super cool in
terms of remixing the data with other data sets, we'd appreciate a
link and guidance for users of your data set to cite us.
Why aren't you trying to get a Science or Nature paper out of all this?
Aren't your reputations going to suffer for doing all this work without
trying to milk the data for all it's worth?
We're pretty sure our reputations won't suffer from making a bunch of
useful data available. Heck, we're pretty sure
wouldn't suffer from making a bunch of useful data available (hint).
The genome is, like, completely useless!? The N50 is about 250 bases!
Yep. Sorry! If you give us $200k we will give you a better genome in
~6 months (annotations not included). In the meantime, we think that
this genome is extremely useful for determining exon structure and
avoiding degenerate PCR at all costs. In fact, this genome shows
near-complete coverage of the ORFs from a handful of cDNAs cloned
I'm serious about the $200k. And that's direct, not direct+indirect.
Do you want to collaborate? I have a bunch of analyses that you can do
for me -- Science/Nature paper guaranteed!
Nope, sorry, too busy. I am, however, happy to
not collaborate --
see the open marine transcriptome project.
Send us your tired, your poor, your huddled transcriptomes...
We'd love your thoughts on these questions of ours in the comments
Is this a good way for us to post the data? Is there a better way?
How important is raw data (raw reads) to you? Should we accelerate
the posting of the raw data? (You can have it now for all we care, but
we don't have a good place to post really big files.)
We'd also love any other feedback, but I reserve the right to publicly
post negative feedback that is particularly entertaining.