[Biopython] Parsing large blast files

Tue Apr 28 13:00:07 UTC 2009

--- On Mon, 4/27/09, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > Would NCBIStandalone.Iterator() be faster?
> 
> NCBIStandalone.Iterator() is the old semi-obsolete plain
> text parser - it won't parse the XML output, hence the
> "Invalid header" error.  Maybe the tutorial
> (or the error message) could be clearer.

I think part of the problem is the organization of the code in Bio.Blast, which seems to have grown historically. Bio.Blast.NCBIStandalone contains blastall, blastpgp, and rpsblast, which makes sense, but also  BlastParser and PsiBlastParser, which are not necessarily connected to standalone Blast. Bio.Blast.ParseBlastTable contains the parser for blastpgp output. Bio.Blast.NCBIWWW contains qblast, but also the parser for Blast HTML output, though qblast does not necessarily generate output in HTML format.

The usage of this module may be more understandable if all functions were accessible from Bio.Blast directly in a fashion more consistent with current Biopython. Bio.Blast would then have the following functions:

read(handle, format='xml')
parse(handle, format='xml')
blastall
blastpgp
rpsblast
qblast

with most of the actual code hiding in Bio.Blast.NCBIStandalone etcetera.

Any objections, comments?

--Michiel.