[Biopython-announce] blast parser

Arne Mueller a.mueller@icrf.icnet.uk
Mon, 13 Dec 1999 23:51:02 +0000


Hi Biopython people,

version 1.1 of my Gap-BLAST/PSI-BLAST parser is available:

 http://www.bmm.icnet.uk/people/mueller/
 

As you probally know there is a BLAST parser project from Jeffrey Chang
underway which will probally be the 'official' BLAST parser for
biopython in future. Anyway the parser I can offer is rather specific
(only GAP or PSI-BLAST with standard options!) but ready for download
and testing. So anyone who is desperate looking for something that reads
standard BLAST output of peptide searches (like me) can try it.

The main changes since version 1.0 are [snip from blast.py module doc
string]:

blast.py (the blast parser):

- Class Blast doesn't get an optional argument IterationClass used to
generate
Iteration objects, also Iteration doesn't get a HitClass and Hit doens't
get a
HSP class to generate HSP objects.

- The blast object gets an optional iteration, hit and hsp class which
is
used to generate the apporiate objects.

- added drift check to blast class. Sometimes PSI-BLAST looses hits
collected
during the first iteration. The blast-object class provides a method to
stop
collecting hits before a drift is detected (e.g. when hits of the first
iteration)
get lost in iteration 4 parsing will be aborted after iteration 4). 

- attributes hit.name and hit.db are removed. Only hit.id exists. That
makes
parsing more flexible with respect to different database formats (e.g.
NCBI
NRPROT)

- change of ending parsing state, blast objects are persistante and can
be
pickled.

- number of hits in summary block doesn't have to be equal number of
sequences
in alignment block (-v 0 -b 2000 is possible and fice versa)

- parsing and storing of blast run information, blast header and footer
are stored in blast object (size of datbase, length of query, etc ...)

- compilation of patterns for Tokens takes place inside the class and
not in
object construction (__init__). That means the regular expressions are
compiled
only once! Classes inhereting from these classes can still change the
tokens list
and define their own patterns.

- class Token now accepts a string or a pre compiled re object as first
argument
(necessary for implementation of tyhe previous item).

- parseAlignments in Class hit is changed from recursive to iterative
implementation to avoid large execution stack

blastflt.py (example program using the parser):

- commandline options: 'evalue', 'pid' to exclude hits below e-value or
percent identity cutoff, option 'driftchk' to activate the drift-fliuter
in the blast-parser with certain -evalue cutoff.

I'm happy for comments, bug reports and questions.

	thanks alot,

	Arne

-- 
Arne Mueller
Biomolecular Modelling Laboratory
Imperial Cancer Research Fund
44 Lincoln's Inn Fields
London WC2A 3PX, U.K.
phone : +44-(0)171 2693405      | fax :+44-(0)171-269-3534
email : a.mueller@icrf.icnet.uk | http://www.icnet.uk/bmm/