Here are talk notes and links for my PyCon 2015 talk.
The talk slides are up on SlideShare.
General background
You should definitely check out Mike Lin's great blog posts on "Blogging my genome".
I found SNPedia through this wonderful blog post on how to use 23andMe irresponsibly, on Slate Star Codex.
My introduction to bcbio came from Brad Chapman's excellent blog post on evaluating and comparing variant detection methods.
There are several openly available benchmarking data sets for human genetics/genomics. The Ashkenazim data set I used for my talk is here, and you can see the Personal Genome Project profile for the son, here. The raw data is available here, and you can see the resequencing report for the son, here.
The Personal Genome Project is something worth checking out.
More and more of human genetics and genomics is "open" -- check out the Variant Call Format (VCF) spec, now on github.
Follow-on links
If you're interested in keeping up with human genomics, Twitter is a pretty good place to go. I asked who to follow and got a great list -- go here.
I asked on Twitter about reference papers for human genomic diversity and got a bunch of great references; all worth skimming.
Pipeline
To run the bcbio variant calling pipeline I discuss in the talk, or examine the SNPs in the Ashkenazim trio with Gemini, take a look at my pipeline notes. The Gemini part will let you examine SNPs for the three individuals in the Ashkenazi trio, starting from the VCF files.
Slide notes
Slide 4: this link explains recombination and inheritance REALLY well.
This John Hawks' blog post is my source for 300-600 novel mutations per generation.
Slide 19:
You can read more about the Ashkenazi Jews here.
The data sets are available here.
Slide 27: Canavan Disease
Slides 30 and 31 from Demographic events and evolutionary forces shaping European genetic diversity by Veeramah and Novembre, 2014.
Slide 32 from Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls, 2007.
Slide 35: the "narcissome" link
Slide 36: a paper on lack of concordance amongst variant callers.
Slide 37: the gene drive link.
How to get involved
I asked the bcbio and gemini folk if there were any opportunities for Python folk to get involved in their work.
Here are some of their thoughts:
fixing the slowness of bigwig parsing in bx-python would be a great project. See my last comment here which pinpoints the bottleneck:
https://bitbucket.org/james_taylor/bx-python/issue/38/read-a-bigwig-file-is-slow
another good "open" project is the SQLite to PostgreSQL conversion to help provide improved speed for larger input files. There is some in-progress work from Aaron Quinlan on this branch:
From the bcbio side, the biggest help we could use from non-biology technical folks is improving the use and cleanliness of the Cloud port:
https://bcbio-nextgen.readthedocs.org/en/latest/contents/cloud.html
and in moving to use the common workflow language (CWL) as a backend for running computations:
https://github.com/common-workflow-language/common-workflow-language
For folks with workflow/distributed experience, there is a reference implementation in Python that needs extension and parallelization:
https://github.com/common-workflow-language/common-workflow-language/tree/master/reference
Links from during the talk
On sequencing, and nanopore
A few people asked about sequencing tech; this is a pretty good intro to the latest thing, nanopore sequencing with Oxford Nanopore
On data science in academia
Definitely read Jake Vanderplas on the Big Data Brain Drain and Hacking Academia.
A few people were interested in my talk, How to get tenure (while being open), which I gave last week for Right to Research Coalition (talk announcement link).
--titus
Comments !