Sun, 30 Aug 2009
Why does my iPhone know how to spell Cthulhu?
Very odd. I mean, it's nice to have my prayers spell-checked and all, but really, Apple? Cthulhu?
Also, jinja2 rocks. I think I'll be teaching it as a templating language this term...
And finally, people interested in using sqlite3 for shelve-like storage in Python 2.x can take a look at issue 52 in pygr's issue tracker; I've taken the code from bugs.python.org/issue3783 and "backported" it to Python 2.x. Since bsddb is no longer going to be part of the Python stdlib, we're planning to switch to using sqlite3 for scalable data storage.
--titus
posted at: 10:58 | path: /aug-09 | 10 comments
Thu, 27 Aug 2009
Teaching girls to program in Ruby
Sarah Mei posts about teaching Ruby to high school girls. Good stuff.
While searching for some GHOP info from way back, I ran across this post asking "where are the girls among the GHOP winners?" (The statistics mentioned in the post may have been posted since, although I haven't seen them.) We asked the Python mentors to "rate" the students, and the hands-down winner was someone who had worked closely with several different mentors and performed very well. Perhaps next time we should highlight everyone who did well; there were several women in the group, too.
In general it's tough to raise the visibility of minority groups, though. Do we engage in affirmative action of some sort, and if so, how do we do so without being unfair to others? Or do we simply rank people in a presumably gender- and color-blind way and see what happens? I've talked several times in various venues about trying to run a female-oriented GSoC derivative, like GNOME's Women's Summer Outreach Program, which would at least call attention to one minority in OSS... and if GHOP ever happens again, we could work on getting younger women involved.
--titus
posted at: 20:02 | path: /aug-09 | 2 comments
Tue, 25 Aug 2009
XML sucks for big data, or hadn't you noticed?
Courtesy of Rich Enbody, this blog post, How XML Threatens Big Data -- Dataspora, elicited a big "duh" from me.
You don't solve any of the semantic problems with data by elaborating on a textual format. You may bring them into the light, but along with the visibility comes "bureaucracy" -- technology, acronyms, proponents and opponents, and the usual cruft.
I find the "embrace lazy data modeling" rule rather funny, personally, because it is the data-world's counterpart to agile methodologies in software development: solve problems you actually have, rather than all the potential ones you see.
I do like the "15 minute" rule: if I can't parse some useful information out of your data format in 15 minutes, you've done something very wrong.
--titus
posted at: 20:40 | path: /aug-09 | 2 comments
Who belongs in the PSF?
So, it's nomination season for the Python Software Foundation again... and I have this niggling feeling that I'm forgetting about several people that have demonstrated significant commitment to the Python community, are good 'uns, and are otherwise people I would trust with some part of the future of Python and the community.
PSF membership "worthiness" is hard to define, but (for me) depends on more than just working in your own small patch of Python -- contributing broadly to GSoC/GHOP, for example, seems like a good criterion. Remember, people need a substantial positive vote by the existing PSF, too, so they should be Known. For some value of Known.
Any thoughts? Drop me a line.
--titus
posted at: 09:14 | path: /aug-09 | 0 comments
Wed, 19 Aug 2009
Calculating necessary coverage for ChIP-seq
OK, so you have a genome -- let's say it's about 1gb in size -- and you want to do ChIP-seq on a transcription factor that you think binds ~1000 places in the genome. You've measured the specificity of the transcription factor and it seems to enrich about 10-fold over background (an OK but not fantastic number). How much sequencing do you need to do to see a statistically significant signal?
We need two other numbers. First, what fragment size are you going to use? And second, what level of signal over background do you want to see?
Let's choose a fragment size of 300 bp and look for a 10:1 signal over background, just for grins.
The math then works out like this: you need roughly 1 sample from each fragment in the ChIP mixture (background + specific fragments) to get an average 10x signal with a 10-fold enrichment. So, if the background is 1 gb / 300 bp, or 3 x 10^6 fragments, and the signal is 10x enrichment x 1000 locations, or 10^4, then you need N=(3x10^6 + 10^4) samples to hit each background fragment once & each "real" location 10 times. Note that (3x10^6 + 10^4) is approximately 3x10^6: that is, the background dominates the necessary sequencing for transcription factors that bind in so few places in the genome.
If you want at least one sample (on average) from each of 3x10^6 fragments, you want 3x10^6 samples. A single Solexa lane yields approx 2.5x10^6 mappable reads (as of the last data sets I have -- so it should have improved by now), suggesting that a single Solexa lane should yield nearly enough samples to see a <deep breath> 10x signal with a 10:1 enrichment over background by ChIP in a 1 gb genome with a 300 bp fragment size.
Now, are these realistic numbers? In some cases, yes; in others, I don't know, but I think so. Some factoids and guesstimates:
the chick genome is 1.2 gb in size, or approximately 1 gb.
300 bp is a "typical" choice for fragment size, although of course you'd prefer smaller (for better resolution).
a 10x signal should stand out over background, using an off-the-cuff estimate of variance (sqrt(10) ~ 3, so 95% of your true peaks should have > (10 - 2var == 4) reads associated with them).
Of course, you're going to have a lot of background, so there will be many peaks that are just coincidental. I'm not sure how to do that napkin-sized calculation -- should I just look three standard deviations out (1 +/- 3 = 4) and see how many peaks I'd expect in that interval? It should be less than 1 percent, but that's still an awful lot of peaks when you're considering a background of 3x10^6 fragments.
People (TM) tell me that 10:1 enrichment is not atypical of an OK antibody.
So, assuming these numbers are about right, where can you optimize most easily? There aren't any non-linear influences, so you can't look to tweak there; it comes down to what's easy and cheap.
Mo' sequencing = mo' bettah, obviously, but each doubling of signal also doubles your cost.
Increasing your fragment size is cheap and divides out linearly -- a doubling from a 300 to a 600 bp fragment size will give you twice as much signal. Unfortunately it will also decrease your resolution, which will in turn have a significant impact on any bioinformatics you might do for finding binding sites.
But you could always go back and confirm your predictions with ChIP-QPCR, which most people do anyway.
You could start with a better antibody, too ;). Of course, usually that's not so easy to obtain.
I seem to recall that the Johnson et al. paper (pubmed 17540862) used an NRSF antibody that was estimated to yield 100x enrichment. Obviously that would significantly help with your signal-to-noise ratio!
I guess I'd look to changing the fragment size first, hmm. Wonder what kind of effect that has on the bioinformatics? I'll have to think about that more.
I'd appreciate any comments or thoughts people might have... I'm not even sure this is the right approach to calculating the numbers, but it makes sense to me! Please comment, or drop me a note at ctb@msu.edu.
--titus
p.s. Yes, background is not uniform. But I have yet to see a good method for calculating it; most people simply do a negative control and sequence that, too.
posted at: 10:05 | path: /aug-09 | 0 comments