Mon, 30 Apr 2007

A new BLAST parser


I spent the weekend hacking out a BLAST parsing package with pyparsing.

BLAST is a really common bioinformatics tool used to search large-ish sequence databases, and the NCBI BLAST program is probably the single most heavily used program in bioinformatics by a long shot. Unfortunately, the NCBI folk have a habit of making tools with idiosyncratic output formats, and AFAIK the only way to obtain all of the information calculated by BLAST is to parse the (human-readable) text format.

This text format is not only human-readable (and not very machine-readable) but it changes fairly regularly, breaking parsers in packages like BioPython. Since I'm already using pyparsing in twill, and I appreciate its very nice syntax, I decided to try writing a maintainable BLAST parser with pyparsing. (The other primary goals were to build a nice Pythonic API and to simplify the use of introspection.)

It took me a long time (all weekend!) to do so, but I've finally got a nice, simple API and what seems to be a largely functioning parser:

for record in parse_file('blast_output.txt'):
   print '-', record.query_name
   for hit in record.hits:
      print '--', hit.subject_name, hit.subject_length
      for submatch in hit.matches:
         print submatch.expect, submatch.bits

         alignment = submatch.alignment
         print alignment.query_sequence
         print alignment.alignment
         print alignment.subject_sequence

It's not really ready for unsupervised use yet, but if anyone out there is jonesin' for a BLAST parser and wants to try this one out, please let me know via e-mail and I'll send it your way. I'd appreciate comments.

--titus

posted at: 07:57 | path: /apr-07 | 4 comments

Tags: ,