This winter quarter I taught my usual graduate-level introductory
bioinformatics lab at UC Davis, GGG 201(b), for the fourth time. The
course lectures are given by Megan Dennis and Fereydoun Hormozdiari,
and I do a largely separate lab that aims to teach the basics of
practical variant calling, de novo assembly, and RNAseq differential
expression.
I also co-developed and co-taught a new course, GGG 298 / Tools for
Data Intensive Research, with Shannon Joslin, a graduate student here
in Genetics & Genomics who (among other things) took GGG 201(b) the
first time I offered it. GGG 298 is a series of ten half-day workshops
where we teach shell, conda, snakemake, git, RMarkdown, etc - you can
see
the syllabus for GGG 298 here.
This time around, I did a complete redesign of the
GGG 201(b) lab (see syllabus)
to focus on using
snakemake workflows.
I'm 80% happy with how it went - there's some overall fine tuning to
be done, and snakemake has some corners that need more explaining than
other corners, but I think the basic concepts got through to a lot of
the students. I also think I'm finally teaching people something they
really need to know, which is how to build, automate, place controls
on, and execute complex bioinformatics workflows.
I was traveling the week before last, so I asked Taylor Reiter and
Tessa Pierce to do the first RNAseq lecture for the class (week 8!) As
part of their
brilliant RNAseq materials
for the class (snakemake! salmon! tximeta! DESeq2! RMarkdown!), Tessa
used a cute trick in the Snakefile that I hadn't seen before. It's
"obvious" if you're a Python+snakemake expert, but many people aren't,
and in any case it's always nice to share, right??
Below, I take the opportunity to share several solutions for loading
sample names into the Snakefile.
(These are fairly boilerplate examples that you can use in your own
code with little modification, too!)
Cute snakemake trick #1: dictionaries for downloads
The following code snippet is a nice, simple Pythonic way to download
a bunch of files from Web URLs.
# list sample names & download URLs.
sample_links = {"ERR458493": "https://osf.io/5daup/download",
"ERR458494":"https://osf.io/8rvh5/download",
"ERR458495":"https://osf.io/2wvn3/download",
"ERR458500":"https://osf.io/xju4a/download",
"ERR458501": "https://osf.io/nmqe6/download",
"ERR458502": "https://osf.io/qfsze/download"}
# the sample names are dictionary keys in sample_links. extract them to a list we can use below
SAMPLES=sample_links.keys()
# download yeast rna-seq data from Schurch et al, 2016 study
rule download_all:
input:
expand("rnaseq/raw_data/{sample}.fq.gz", sample=SAMPLES)
# rule to download each individual file specified in sample_links
rule download_reads:
output: "rnaseq/raw_data/{sample}.fq.gz"
params:
# dynamically generate the download link directly from the dictionary
download_link = lambda wildcards: sample_links[wildcards.sample]
shell: """
curl -L {params.download_link} -o {output}
"""
Cute snakemake trick #2: loading filenames from the current directory.
(I don't recommend this approach. Read on.)
One of the most common questions I've been asked in the last few weeks
is how to avoid typing all of the sample names into the
Snakefile. (This can matter a lot when you have hundreds of samples!)
After you download the files above, you can get a list of the
downloaded files like so:
sample_ids = glob_wildcards("rnaseq/raw_data/{sample}.fq.gz")
Now, sample_ids
is a Python list that behaves just like SAMPLES
,
and it can be used with expand
.
Note, for this example, SAMPLES
and sample_ids
are going to
contain the same list of files. The difference is that sample_ids
are loaded from the directory listing, while SAMPLES
has to be
written out in the Snakefile somehow (here, in sample_links
).
Why don't I recommend this approach? You can only use this
approach if the files already exist in the directory. That's fine -
often you don't want to copy them in or download them dynamically! -
but it sets up a particular kind of potential error. If you load the
list of samples from your working directory, and you accidentally
delete one of the sample files, you'll omit it from your workflow
without knowing.
It's much better to independently specify the list of files, so that
if you accidentally delete one, snakemake will complain. That's where
the next trick comes in.
As a bonus, the next approach lets you specify metadata in the
spreadsheet, which is important!
Cute snakemake trick #3: loading a list of sample names from a spreadsheet.
This is taken from a really nice, clean
example RNAseq workflow that uses STAR and DESeq2,
written by Johannes Köster, Sebastian Schmeier, and Jose Maturana.
Here,
the Snakefile
loads sample names from a tab-separated values spreadsheet using
pandas; a simplified version of the code follows:
import pandas as pd
samples_df = pd.read_table('samples.tsv').set_index("sample", drop=False)
sample_names = list(samples_df['sample'])
Here, sample_names
is the same as SAMPLES
and sample_ids
, above - a list that you can use in expand
and so on. The difference here is that samples_df
is a Pandas dataframe that contains other information, such as sample metadata; and it's loaded from a TSV file that can be created, visualized, and edited using spreadsheet software.
Cute snakemake trick #4: loading a list of download links from a spreadsheet.
The TSV approach is particularly useful for downloading files or
moving files, as the download links or file paths can be included in
the spreadsheet, rather than at the top of the Snakefile (as they were
in cute trick #1).
Considering the same yeast RNAseq data as the first example and a TSV
file containing the sample names and download links, samples can be
downloaded like so:
import pandas as pd
samples_df = pd.read_table('samples.tsv').set_index("sample", drop=False)
SAMPLES = list(samples_df['sample'])
# download yeast rna-seq data from Schurch et al, 2016 study
rule download_all:
input:
expand("rnaseq/raw_data/{sample}.fq.gz", sample=SAMPLES)
# rule to download each individual file specified in samples_df
rule download_reads:
output: "rnaseq/raw_data/{sample}.fq.gz"
params:
# dynamically grab the download link from the "dl_link" column in the samples data frame
download_link = lambda wildcards: samples_df.loc[wildcards.sample, "dl_link"]
shell: """
curl -L {params.download_link} -o {output}
"""
Enjoy! And comments are welcome!
--titus (and Tessa and Taylor!)
There are comments.