As part of our Docker hands-on workshop
earlier this month, I learned a lot about building Dockerfiles,
running Docker containers on remote hosts with docker-machine, and
using data volumes to manage data in remotely hosted Docker
containers.
During and after the workshop, I put together Docker images (and, more
importantly, build instructions *for* those images) for a few different
pieces of transcriptomics software: khmer, for digital normalization; salmon, for transcript quantification;
transrate, for transcriptome
quality evaluation; and dammit,
a pipeline for transcriptome annotation.
Other than remedying my basic ignorance of (first) docker-machine and
(second) data volumes, the only somewhat tricky bit was dammit's
databases. dammit relies on a fairly large collection of databases,
and these databases need to be established locally (downloaded and
processed) in order to run dammit. While time consuming, this only
needs to be done once. So I had to fiddle around a bit, and ended up
with an image that downloads the data if it's not present
(diblab/dammit-db-helper).
It's important to note that these data volume issues are things you
run into when you can't (or don't want to) mount local volumes
because you're using docker-machine to run your Docker containers on a
remote host. If you're using Docker locally, you can just put
everything on local disk and mount those to the running Docker
containers. But in this case I explicitly want to make use of
resources greater than are available on my laptop by using
docker-machine.
Below are some demo instructions for running transrate and dammit to
evaluate and annotate a transcriptome, using Docker containers.
Everything below should work on both local and remote Docker hosts
(i.e. docker default install or docker-machine), assuming the docker
host has about 15 GB of disk space available for the dammit databases.
The time consuming bits are (a) downloading the Docker images, and (b)
downloading & installing the dammit databases.
Personally, I found installing transrate and dammit (and salmon) to be
big PITA so the fact that I may never have to do that again is a big
win :).
Comments welcome -- I'd love to find easier/better ways of doing this!
--titus
Preparing the data
First, create a data volume containing your transcriptome, and name it
nema_vol` (after Nematostella vectensis, the organism's
transcriptome that we're using):
docker create -v /nema --name nema_vol ubuntu:15.10 /bin/true
Next, extract the transcriptome (nema.fa.gz) from the remote tar ball:
curl -L https://s3.amazonaws.com/public.ged.msu.edu/nema-subset.tar.gz | tar xzf - nema.fa
Gzip and then copy the transcriptome to /nema/nema.fa.gz on the
nema_vol container:
gzip nema.fa && docker cp nema.fa.gz nema_vol:/nema
This makes it available to other containers via the nema_vol container,
which can be mounted as /nema via the --volumes-from command.
Next, uncompress the nema.fa.gz file:
docker run --rm --volumes-from nema_vol -it ubuntu:15.10 \
gunzip -f /nema/nema.fa.gz
Now, the data is ready: it's available as /nema/nema.fa on any containers
where --volumes-from nema_vol has been used.
Running transrate
To run transrate in its most basic mode, to generate assembly statistics,
you can execute:
docker run --rm --volumes-from nema_vol -it diblab/transrate \
transrate --assembly /nema/nema.fa --output=/nema/nema.fa.transrate
This will output:
[ INFO] 2015-11-20 16:25:07 : Loading assembly: /nema/nema.fa
[ INFO] 2015-11-20 16:25:49 : Analysing assembly: /nema/nema.fa
[ INFO] 2015-11-20 16:25:49 : Results will be saved in /nema/nema.fa.transrate
[ INFO] 2015-11-20 16:25:49 : Calculating contig metrics...
[ INFO] 2015-11-20 16:26:25 : Contig metrics:
[ INFO] 2015-11-20 16:26:25 : -----------------------------------
[ INFO] 2015-11-20 16:26:25 : n seqs 198151
[ INFO] 2015-11-20 16:26:25 : smallest 201
[ INFO] 2015-11-20 16:26:25 : largest 17655
[ INFO] 2015-11-20 16:26:25 : n bases 137744672
[ INFO] 2015-11-20 16:26:25 : mean len 695.15
[ INFO] 2015-11-20 16:26:25 : n under 200 0
[ INFO] 2015-11-20 16:26:25 : n over 1k 37271
[ INFO] 2015-11-20 16:26:25 : n over 10k 64
[ INFO] 2015-11-20 16:26:25 : n with orf 46134
[ INFO] 2015-11-20 16:26:25 : mean orf percent 63.77
[ INFO] 2015-11-20 16:26:25 : n90 252
[ INFO] 2015-11-20 16:26:25 : n70 573
[ INFO] 2015-11-20 16:26:25 : n50 1315
[ INFO] 2015-11-20 16:26:25 : n30 2271
[ INFO] 2015-11-20 16:26:25 : n10 4111
[ INFO] 2015-11-20 16:26:25 : gc 0.44
[ INFO] 2015-11-20 16:26:25 : gc skew 0.01
[ INFO] 2015-11-20 16:26:25 : at skew 0.0
[ INFO] 2015-11-20 16:26:25 : cpg ratio 1.73
[ INFO] 2015-11-20 16:26:25 : bases n 0
[ INFO] 2015-11-20 16:26:25 : proportion n 0.0
[ INFO] 2015-11-20 16:26:25 : linguistic complexity 0.13
[ INFO] 2015-11-20 16:26:25 : Contig metrics done in 36 seconds
[ INFO] 2015-11-20 16:26:25 : No reads provided, skipping read diagnostics
[ INFO] 2015-11-20 16:26:25 : No reference provided, skipping comparative diagnostics
[ INFO] 2015-11-20 16:26:25 : Writing contig metrics for each contig to /nema/nema.fa.transrate/nema/contigs.csv
[ INFO] 2015-11-20 16:26:55 : Writing analysis results to assemblies.csv
Running dammit
For dammit annotation, let's extract only a few sequences so it doesn't
take too long!
docker run --rm --volumes-from nema_vol -it ubuntu:15.10 \
sh -c 'head -110 /nema/nema.fa > /nema/short.fa'
Now prepare the dammit databases; this can be run multiple times but should
complete very quickly after the first run:
# create a dammit-db data volume; will fail (safely) if run multiple times.
docker create -v /dammit-db --name dammit-db ubuntu:15.10 /bin/true
# download & prepare the databases; can be run multiple times.
docker run --rm --volumes-from dammit-db -it diblab/dammit-db-helper
Finally, run dammit, loading the databases from dammit-db and the
transcriptome data from nema_vol, and putting the output annotation
in /nema/short.fa.dammit:
docker run --volumes-from dammit-db --volumes-from nema_vol \
-it diblab/dammit \
dammit annotate /nema/short.fa -o /nema/short.fa.dammit
This yields the following runtime output:
--- Running annotate!
Transcriptome file: /nema/short.fa
Output directory: /nema/short.fa.dammit
[x] sanitize_fasta:short.fa
[x] transcriptome_stats:short.fa
[x] busco:short.fa-metazoa
[x] TransDecoder.LongOrfs:short.fa
[x] hmmscan:longest_orfs.pep.x.Pfam-A.hmm
[x] TransDecoder.Predict:short.fa
[x] cmscan:short.fa.x.Rfam.cm
[x] lastal:short.fa.x.orthodb.maf
[x] maf_best_hits:short.fa.x.orthodb.maf-short.fa.x.orthodb.maf.best.csv
[x] maf-gff3:short.fa.x.orthodb.maf.gff3
[x] hmmscan-gff3:short.fa.pfam-A.tbl.gff3
[x] cmscan-gff3:short.fa.rfam.tbl.gff3
[x] gff3-merge:short.fa.dammit.gff3
After all of this, you can grab the annotation results like so:
# copy off the transrate output:
docker cp nema_vol:/nema/nema.fa.transrate/assemblies.csv .
docker cp nema_vol:/nema/nema.fa.transrate/nema/contigs.csv .
# copy off the final GFF3 transcriptome annotation from dammit:
docker cp nema_vol:/nema/short.fa.dammit/short.fa.dammit.gff3 .
The nema_vol data volume can be removed with:
dammit rm -v nema_vol
There are comments.