Writing applications around workflow systems, take 2.
read moreThere are comments.
Note: Turns out Nick Loman is a C programmer. Well, that's what happens when I make assumptions, folks ;).
Jared Simpson just posted a great blog entry on nanopolish, an HMM-based consensus caller for Oxford Nanopore data. In it he describes how he moved from a Python prototype to a standalone …
read moreThere are comments.
I participated in my second Balti and Bioinformatics on Wednesday - unlike the first one, which ended with only slightly sketchy Indian food in Birmingham, this one was entirely online. The technology worked really well and I think this is a great way to do talks!
For those that haven't seen …
read moreThere are comments.
The fifth annual Analyzing Next Generation Sequencing Data workshop just finished - #ngs2014. As usual the schedule and all of the materials are openly available.
tl; dr? Good stuff.
We've been running this thing since 2010, and we now have almost 120 alumni (5 classes of roughly 24 students each). The …
read moreThere are comments.
These are the talk notes for my opening talk at the 2014 Bioinformatics Open Source Conference.
Normally my talk notes aren't quite so extensive, but for some reason I thought it would be a good idea to give an "interesting" talk, so my talk title was "A History of Bioinformatics …
read moreThere are comments.
Note: this post is a guest post by Jerome Kelleher. Please also see his letter to Bioinformatics on this topic.
There are comments.
(with Adina Howe, James Tiedje, Titus Brown)
I have been working on the assembly of big shotgun metagenomic data from ARMO (Amazon Rain Forest Microbial Observatory) project. The biggest challenge is the huge data size, 2TB in fastq and more than 6 billions reads after read trimming. One lucky thing …
read moreThere are comments.
Or, "can we crowdsource BGI?" ;)
With all of the crazy need surrounding genomic analysis -- most of it on a shoestring budget -- I am thinking about a mildly crazy idea.
What if I offered to computationally analyze people's non-model transcriptomic and metagenomic data for them, in exchange for (a) non-exclusive access …
read moreThere are comments.
At our 2012 course on Analyzing Next-Generation Sequencing Data, we talked quite a bit about future sequencing technologies, as well as about what analyses are reasonably cookbook (and which ones aren't).
Here are my thoughts -- yours welcome!
The basic conclusions about sequencing tech were these:
There are comments.
Brad Chapman (@chapmanb on twitter) wrote and signed a nice review of my submission to the Bioinformatics Open Source Conference. In his review, he said
My only small suggestion is to include some discussion about your reproducibility work during the talk: the Amazon AMI, documentation and reproducible ipython workflows. This …read more
There are comments.
I'm a pretty big advocate of anything open -- open source, open access, and open science, in particular. I always have been. And now that I'm a professor, I've been trying to figure out how to actually practice open science effectively
What is open science? Well, I think of it as …
read moreThere are comments.
I'm pretty proud of our most recently posted paper, which is on a sequence analysis concept we call digital normalization. I think the paper is pretty kick-ass, but so is the way in which we're approaching replication. This blog post is about the latter.
(Quick note re "replication" vs "reproduction …
read moreThere are comments.
I'm putting together a computational pipeline for a paper - a Makefile that runs a ton of stuff and outputs files, combined with an ipython notebook file that takes those output files and turns them into figures for inclusion in a LaTeX file. (Yes, very 2000, except for the ipython notebook …
read moreThere are comments.
If you're like me, we pretend to care about the science in bioinformatics software. But what we really do is try to find reasons not to outright loathe the software -- because, lud knows, there are usually plenty of reasons to hate it.
In no particular order, here are the top …
read moreThere are comments.
(updated to point to http://arxiv.org/).
Authors: Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, C. Titus Brown
Abstract:
The memory requirements for de novo assembly of short-read shotgun sequencing data from complex microbial populations are an increasingly large practical barrier to environmental studies. Here we …read more
There are comments.
I'm writing this on my way back from Stockholm, where I attended a workshop on the 4th Paradigm. This is the idea (so named by Jim Gray, I gather?) that data-intensive science is a distinct paradigm from the first three paradigms of scientific investigation -- theory, experiment, and simulation. I was …
read moreThere are comments.
As sequencing gets cheaper and cheaper, one would expect the answer for how to best sequence (and assemble!) any given genome would change. Most biologists assume something along these lines: everyone else has achieved some standard coverage (say 10x, or 100x) for their genome, so all we need to do …
read moreThere are comments.
There's been a lot of hooplah in the last year or so about the fact that our ability to generate sequence has scaled faster than Moore's Law over the last few years, and the attendant challenges of scaling analysis capacity; see Figure 1a and 1b, this reddit discussion, and also …
read moreThere are comments.
The second iteration of our bioinformatics summer course, Analyzing Next-Generation Sequencing Data, just finished. It was a great success, at least judging from the comments that people made to us personally; the evaluations aren't yet complete.
The what: a two week course on analyzing next-gen sequencing data, using the Amazon …
read moreThere are comments.
First, I write a recipe file, 'metagenome.recipe', laying out my job description for, say, sequence trimming and assembly with Velvet:
fasta_file soil-data.fa qc_filter min_length=50 remove_Ns=true graph_filter min_length=400 velvet_assemble k=33 min_length=1000 scaffolding=True
Then I specify …
read moreThere are comments.
I just parachuted in on (and heli'd out of?) the Beyond the Genome conference in Boston. I gave a very brief workshop on using EC2 for sequence analysis, which seemed well received. (Mind you, virtually everything possible went wrong, from lack of good network access to lack of attendee computers …
read moreThere are comments.
(with Adina Howe, Jason Pell, Rosangela Canino-Koning, and Arend Hintze).
A few weeks ago I blogged a bit about a k-mer filtering system, khmer, that we were using to reduce metagenomic data to a more tractable size by throwing out error-prone reads (see A memory efficient way to remote …
There are comments.
Course page at: http://ged.msu.edu/courses/2010-fall-cse-891/:
This course will introduce biologists to computational thinking, practical computational techniques, and research topics in computational evolution. The course will consist of three intensive hands-on 5-week modules: computational competence in UNIX; data mining and hypothesis generation using the Avida digital life …read more
There are comments.
(This project is a collaboration with Jason Pell and Adina Howe)
A few weeks ago I posted about a k-mer filtering approach that we were using to remove low-abundance k-mers from metagenomic data sets, prior to assembly. This technique is working well, and we've managed to do some assembly of …
read moreThere are comments.
The Terabase Metagenomics meeting was good fun, but I most valued the computational component (because that's what I do). Rachel Mackelprang and Rob Knight and I wrote down a list of the computational issues involved in a petabase metagenomics project, and that list will help direct my future research. I'll …
read moreThere are comments.
I'm on my way back from the Terabase Metagenomics meeting in Snowbird, UT, and I'm buzzing with ideas about how to move forward in metagenomics and bioinformatics research. Metagenomics, the use of genomics approaches to study microbial communities, has been opening up as sequencing drops in price. With sequencing becoming …
read moreThere are comments.
I've spent the last few weeks working on a simple solution to a challenging problem in DNA sequence assembly, and I think we've got a nice simple theoretical solution with an actual implementation. I'd be interested in comments!
Briefly, the algorithmic challenge is this:
We have a bunch of …
There are comments.
After my recent next-gen sequencing course, which was supposed to tie into the whole software carpentry (SWC) effort but didn't really succeed in doing so the first time through, I started thinking about the Right Way to tie in the SWC material. In particular, how do you both motivate scientists …
read moreThere are comments.
Our sequencing analysis course ended last Friday, with an overwhelmingly positive response from the students. The few negative comments that I got were largely about organizational issues, and could be reshaped as suggestions for next time rather than as condemnations of this year's course.
The 23 students -- most with no …
read moreThere are comments.
So, I've been teaching a course on next-generation sequence analysis for the last week, and one of the issues I had to deal with before I proposed the course was how to deal with the volume of data and the required computation.
You see, next-generation sequence analysis involves analyzing not …
read moreThere are comments.
So, I'm running this summer course and I am trying to figure out how to organize the notes for students. I'd like to mix curriculum-specific notes ("here's what we're doing today, and here are some problems to work on") with tutorials (material independent of a single course, like "here's how …
read moreThere are comments.
In conversation with a colleague the other day, I found myself making a surprising prediction: the age of the big sequencing centers (Broad Institute, WUSTL, Baylor, DOE JGI, etc.) is coming to an end. In 5 years they will no longer exist.
This prediction is obvious in hindsight.
That is …
read moreThere are comments.
I've been doing some more focused bioinformatics programming recently, and as I'm thinking about how to teach biologists about data analysis, I realize more and more how much backstory goes into even relatively simple programming.
The problem: given a reference genome, and a very large set of short, error-prone, random …
read moreThere are comments.
These days, molecular biologists are dealing with lots and lots of sequences, largely due to next-gen sequencing technologies. For example, the Illumina GA2 is producing 100-200 million DNA sequences, each of 75-125 bases, per run; that works out to 20 gb of sequence data per run, not counting metadata such …
read moreThere are comments.
Analyzing Next-Generation Sequencing Data
May 31 - June 11th, 2010
Kellogg Biological Station, Michigan State University
CSE 891 s431 / MMG 890 s433, 2 cr
Applications are due by midnight EST, April 9th, 2010.
Course sponsor: Gene Expression in Disease and Development Focus Group at Michigan State University.
Instructors: Dr. C. Titus …
read moreThere are comments.
OK, so you have a genome -- let's say it's about 1gb in size -- and you want to do ChIP-seq on a transcription factor that you think binds ~1000 places in the genome. You've measured the specificity of the transcription factor and it seems to enrich about 10-fold over background (an …
read moreThere are comments.
Just submitted this on Thursday:
Next generation sequencers are beginning to impact agricultural biology. Over the next few years, next generation sequencing will produce incredibly large datasets that will address structural (e.g., SNPs, CNVs, indels, methylation, translocations) and functional (e.g., RNA expression, transcription factor binding sites) variation in …read more
There are comments.
The decision of python-dev to deprecate bsddb has left us in a bit of a pickle (hah!) over in the pygr project. We're looking for a replacement for bsddb for default storage of infrequently- (or never-) changed pickled Python objects. Some of the parameters under consideration are:
read more
- Python version availability …
There are comments.
The latest hot shit idea for making a protein-protein interaction database leaves me lukewarm.
A few months ago I met with a genomics group, and we had a back-and-forth about genome annotation. The conversation went something like this:
them: "We have to improve the tools for annotating un-annotated genes!" me …read more
There are comments.
My last post initiated a discussion on the biology-in-python mailing list about BioPython, among other things. (Here is a link to the discussion, which is kind of long and unfocused.)
I'm happy that the bip list is serving as a place for people to interact with the BioPython maintainers to …
read moreThere are comments.
Chris Lasher wrote a nice blog post naming me as a rabble rouser in the area of "Python in bioinformatics". His post raised a number of interesting points, some of which I'd like to discuss here on my blog.
First, why is Python not more dominant in bioinformatics? I really …
read moreThere are comments.
We have an opening for a project on which I'm collaborating:
Full-time 12 month appointment academic position for a genomics scientist. The incumbent will spend 50% time as the Associate Director of the Comparative Genomics Laboratory, with duties in directing daily activities, long-range planning and seeking extramural funding, and 50 …read more
There are comments.
I read things like this report on SciFoo and think, gawd! I'd have had a great time! I should try to beg/bully/buy/brown-nose my way into the next SciFoo so I can talk about Science 2.0 etc.!
And then I think back to the heady days of …
read moreThere are comments.
I finally got sick of manually schlepping BLAST files around, so I wrote something to do it for me. 'zounds' is a very simple server/client system for coordinating a bunch of 'worker' nodes through a central server; it does everything in Python with objects and pickling, so it's easy …
read moreThere are comments.
(pygr is a neat bioinformatics framework in Python.)
After some commenters on my last post seemed happy to hear that pygr was the focus of some summer work, I realized I had only discussed the pygr summer work in a post to the biology-in-python list.
Whoops.
So, here's the scoop …
read moreThere are comments.
Dear Lazyweb, help!
I'm embarking on a number of summer projects in my new lab at MSU, and several of them focus on using pygr to do cool genomic stuff. In particular, I'm planning to build a personal genome annotation system that will let people run their own full genome …
read moreThere are comments.
I spent some time over the last week adding fairly simple motif searching to Cartwheel, my bioinformatics site for biologists doing cis-regulatory analysis of genomic sequence. The new features include the ability to define and search with IUPAC and position-weight matrix (PWM) motifs, as well as visualization of motif search …
read moreThere are comments.
Via http://www.nodalpoint.org/2008/01/18/one_thousand_databases_high_and_rising, on the Nucleic Acids Res "database" issue:
As we pass the one thousand databases mark (1kDB) I wonder, what proportion of the data in these databases will never be used?
This is an unsettling thought for …
read moreThere are comments.
I just finished a chapter for a book, Methods in Avian Embryology, being edited by my boss, Marianne Bronner-Fraser. This chapter is intended for developmental biologists who are interested in locating regulatory modules and analyzing them for binding sites. It ended up being my outlet for a compilation of problems …
read moreThere are comments.
My Computer Science department at Michigan State University is looking for an assistant professor! We are casting a fairly wide net (databases, graphics, medical imaging, and bioinformatics) but I'd really like to attract a bioinformatician.
The Computer Science department at MSU is a nice, small department …
read moreThere are comments.
So, next May I'm starting as an assistant professor split between the Computer Science and Microbiology and Molecular Genetics departments at Michigan State U., and I'm interested in attracting as many good CS grad applicants as I can from the open source and bioinformatics communities. (I would also like to …
read moreThere are comments.
Rob Campbell found me by google, and pointed me towards his blog, Science and Software. Funny, well written, and very apropos! Why isn't there more software, commercial or otherwise, for labs?
There has been a lot of local interest (i.e. two or three people have discussed it at various …
read moreThere are comments.
After our long software licensing discussion on the biology-in-python list, I realized that I wanted something different in a license for scientific software.
Specifically, I would like to attach the following clause to either a BSD or L/GPL style license:
Publications relying on derivative works of this software must …read more
There are comments.
This month the newly minted biology-in-python mailing list erupted into a discussion of licenses. There was some confusion about the goal of the discussion, for which I'm largely responsible: we didn't make it clear that we were talking about licenses for code and content posted on the bio.scipy.org …
read moreThere are comments.
In the spirit of cleaning up my desktop... here's a PDF of my talk on Cartwheel at SciPy 2007.
--titus
read moreThere are comments.
I'm now listed on the Gene Expression in Disease and Development page, as well as on the CompSci faculty page, MicroMolecularGenetics faculty page, QuantBio page, and SysBio page.
It was quite a shock to log into the CompSci cluster at MSU and see my group set as "faculty". As a …
read moreThere are comments.
It's been a busy few weeks, in part because I've been writing a grant. Last Thursday, I submitted a grant proposal to NIH for their program announcement, Continued Development and Maintenance of Software. The proposal was to continue maintaining Cartwheel, while integrating a new visualization frontend (MUSSA) and a fast …
read moreThere are comments.
So, I "organized" a Biology Birds of a Feather at SciPy 2007. This mainly consisted of posting about it and then trying to write stuff on a white board while keeping abreast of the conversation. About 15 people attended.
I didn't get everyone's name and in any case I don't …
read moreThere are comments.
To get people talking, I've created a "biology-in-python" mailing list. You can subscribe here: http://lists.idyll.org/listinfo/biology-in-python, and you can post to it at bip@lists.idyll.org once you're a member.
This list is a tool/package/library-agnostic list, for people who use Python to work …
read moreThere are comments.
corebio, the joint effort by a junta of California bioinformaticians to replace BioPython with something we like better, is proceeding interestingly. So far we have discussed the following issues:
read more
- what license? (BSD)
- what focus? (sequence manipulation & parsing)
- what about binary extensions? (focus on API, provide fast implementations where appropriate, but …
There are comments.
I've said some mean things about BioPython in the past -- that it's broken, that it's crufty, etc. One prominent former BioPython developer responded with the very reasonable question of why I wasn't fixing it, if it was so broken. The answer, of course, is that I've been working on my …
read moreThere are comments.