I spent the weekend hacking out a BLAST parsing package with
BLAST is a really common bioinformatics tool used to search large-ish
sequence databases, and the NCBI BLAST program is probably the single
most heavily used program in bioinformatics by a long shot.
Unfortunately, the NCBI folk have a habit of making tools with
idiosyncratic output formats, and AFAIK the only way to obtain all
of the information calculated by BLAST is to parse the
(human-readable) text format.
This text format is not only human-readable (and not very
machine-readable) but it changes fairly regularly, breaking parsers in
packages like BioPython. Since I'm already using pyparsing in twill,
and I appreciate its very nice syntax, I decided to try writing a
maintainable BLAST parser with pyparsing. (The other primary goals
were to build a nice Pythonic API and to simplify the use of
It took me a long time (all weekend!) to do so, but I've finally got a
nice, simple API and what seems to be a largely functioning parser:
for record in parse_file('blast_output.txt'):
print '-', record.query_name
for hit in record.hits:
print '--', hit.subject_name, hit.subject_length
for submatch in hit.matches:
print submatch.expect, submatch.bits
alignment = submatch.alignment
It's not really ready for unsupervised use yet, but if anyone out there
is jonesin' for a BLAST parser and wants to try this one out, please
let me know via e-mail and I'll send it your way.
I'd appreciate comments.
Posted by Andrew Dalke on 2007-05-01 at 04:06.
I wrote Martel for Biopython for pretty much the same reasons. That
was around 2000 or so. I used Greg Ewing's PLEX syntax as my starter,
and both have the same style as PyParsing. I have several
conclusions from my Martel work. To start, my interest in data
validation (being very strict about the output) does not match up with
the general desire to be mostly right, and skip/ignore outliers.
Consider this SWISS-PROT record from HAS2_CHICK """DT 30-MAY-2000
(REL. 39, Created)"" That "REL" should be "Rel", according to the
format spec. Everything else is "Rel", and I wrote my parser to only
handle "Rel". When doing my validations I had to weaken the grammar
to support wild-type records. For validation, strictness is useful.
That record was in the wrong format. But in general use, nearly
nobody cares. (Another problem in SWISS-PROT, years ago, had a
bogus DR line in the comment/CC section. Again, I had to special case
that one record.) Now consider the case of a format family. For
example, the SWISS-PROT formats over time. In that case you only need
to handle the most recent version of the format, since they do a
pretty good job of being backwards compatible. But not always. The
BLAST output over time is not backwards compatible with previous
versions. The number of spaces, and newlines changes, the layout of
the output changes, etc. How then do you write a grammar which
handles all of those variations? For example, do you write a single
grammar for blastn, blastp, tblastx, etc.? Or do you write different
grammars for each, hopefully sharing the definitions for the common
parts. Martel supported the latter. I wanted to say "blast_format
= blastn_format | blastp_format | tblastx_format | ...", and that
works. It would work with pyparsing as well. Now suppose the
format changes. Perhaps an older version used a line containing "
\n" in two different places where the new version uses "\n".
(Something like this happened in 2.0.14 or so.) There are two ways
to handle this. One is to write an alternation point at the lowest
level - "this line can either be " \n" or "\n" (or matches \s+\n).
Pro: handles both cases. Another is to write a new format definition
which supports only the new format. Pro: better at validation. The
first solution ends up allowing other BLAST-like formats to be parsed,
in this case if the first alternation has " \n" while the second only
"\n" - not legal in either format. The second ends up being hard to
implement. I think it's done with a tree grammar: "format Y is format
X except replace this node in the tree with my new one". Mmm, this
comment is getting too lengthy. I'll summarize what I just wrote as:
the hard part about using a language grammar for BLAST is supporting
all of the variations of BLAST and its changes through time. I don't
think pyparsing/martel/flex are quite right for this problem. Jeff
Chang started to build a collection of BLAST output over time showing
many different corner cases. Don't think that every went anywhere.
Still, you should be able to mine biopython and bioperl test cases for
variations of BLAST output in formats you might want to handle.
Depends how far back in time you want to support. Nice not having
legacy users :)
Posted by Antonio Cavallo on 2007-05-01 at 04:18.
This all "parsing" has got me out of the Bioinformatics field a long
time ago: at that time I failed to see any science in it.
Posted by Titus Brown on 2007-05-01 at 14:02.
Antonio, yes, it's always nice when you don't have to do the silly
work in order to do the interesting work. Still, here we are...
Andrew, I see your points, and I agree with most of them. I'm
focusing on automated tests and maintainability -- or at least I'd
like to ;) -- and I'm hoping that this will translate into a more
effective parsing situation. Down the road I suspect that I will be
rewriting/refactoring BLAST, but that's a different issue ;)
Posted by Paul McGuire on 2007-05-02 at 11:13.
Some of these vagaries of chance formats are automatically
accommodated by pyparsing's credo, "whitespace doesn't matter".
Adding or removing an extra line of whitespace is completely
transparent to pyparsing's grammar matching logic. Other changes can
be anticipated, and coded defensively. For instance, if a separation
line of '====' signs might change in length (from 72 to 96 characters,
say), then specifying this element as Word('=') is more robust than
Literal('='*72). Other pyparsing robustness-enhancing techniques
are to make liberal use of results names, so that later introduction
of optional fields can be easily inserted into the grammar without
changing down all references to tokens by index - existing tokens are
consistently referenced by name. Lastly, when retrofitting format
changes for new versions, placeholder Forward's can be inserted into
the existing grammar, to be modified at parse time if the new version
is detected - sort of a self-modifying grammar. A parse action
attached to the version id field would insert the expression
definitions for the new version's specialized data into the
placeholder. So I'm hoping that this combination of adaptation
techniques and pyparsing's grammar readability will help Titus to have
a longer-term success in his BLAST processing. -- Paul
There are comments.