Here are talk notes and links for my PyCon 2015 talk.
The talk slides are up on SlideShare.
You should definitely check out Mike Lin's great blog posts on "Blogging my genome".
My introduction to bcbio came from Brad Chapman's excellent blog post on evaluating and comparing variant detection methods.
There are several openly available benchmarking data sets for human genetics/genomics. The Ashkenazim data set I used for my talk is here, and you can see the Personal Genome Project profile for the son, here. The raw data is available here, and you can see the resequencing report for the son, here.
The Personal Genome Project is something worth checking out.
More and more of human genetics and genomics is "open" -- check out the Variant Call Format (VCF) spec, now on github.
To run the bcbio variant calling pipeline I discuss in the talk, or examine the SNPs in the Ashkenazim trio with Gemini, take a look at my pipeline notes. The Gemini part will let you examine SNPs for the three individuals in the Ashkenazi trio, starting from the VCF files.
Slide 4: this link explains recombination and inheritance REALLY well.
This John Hawks' blog post is my source for 300-600 novel mutations per generation.
You can read more about the Ashkenazi Jews here.
The data sets are available here.
Slide 27: Canavan Disease
Slides 30 and 31 from Demographic events and evolutionary forces shaping European genetic diversity by Veeramah and Novembre, 2014.
Slide 35: the "narcissome" link
Slide 36: a paper on lack of concordance amongst variant callers.
Slide 37: the gene drive link.
How to get involved
Here are some of their thoughts:
fixing the slowness of bigwig parsing in bx-python would be a great project. See my last comment here which pinpoints the bottleneck:
another good "open" project is the SQLite to PostgreSQL conversion to help provide improved speed for larger input files. There is some in-progress work from Aaron Quinlan on this branch:
From the bcbio side, the biggest help we could use from non-biology technical folks is improving the use and cleanliness of the Cloud port:
and in moving to use the common workflow language (CWL) as a backend for running computations:
For folks with workflow/distributed experience, there is a reference implementation in Python that needs extension and parallelization: