[Bioperl-l] BLASTXML changes (good this time!)

Chris Fields cjfields at uiuc.edu
Fri Feb 9 13:58:24 UTC 2007


On Feb 8, 2007, at 10:18 PM, Torsten Seemann wrote:

> Chris,
>
>> BLAST XML parsing should now work for any CPAN-based XML::SAX parser!
>> XML::SAX::PurePerl (comes with XML::SAX, the slowest)
>> XML::SAX::Expat
>> XML::SAX::ExpatXS (the fastest)
>> XML::LibXML::SAX
>> XML::LibXML::SAX::Parser
>
> That's excellent news - thanks for all the work you have put in on
> this one. I'm impressed.

Jason did most of the hard work; I just tinkered with it until it  
worked (and pestered a few perl XML guys along the way).  Thanks  
Grant and Björn!

> This is a good opportunity to encourage people who use Bio::SearchIO
> for BLAST parsing to switch to 'blastxml' format over 'blast'.
> Although the latter is more human readable, it perenially requires
> parser source changes to cope with the variations and new formatting
> introduced with each new NCBI BLAST release. Best to use "-m 7" XML
> format, and convert as appropriate using one of the
> Bio::Search::Writer:: classes.
>
> --Torsten

I'll try getting some benchmarks for the different parsers up today  
on the wiki if I have time.

Strangely enough, NCBI changed a few things about BLAST XML a few  
releases back w/o mentioning it to anyone (it was a silent bug in  
BLAST XML parsing which I fixed recently).  If you sent in multiple  
queries in older versions of BLAST you would get all of the BLAST XML  
reports concatenated together, which required preparsing the reports  
to carve up the XML prior to parsing.  Now they treat it like PSI- 
BLAST where multiple queries = multiple iterations, so you get one  
long XML BLAST report where each iteration=Result.

The current parser should handle both as it just caches the other  
results and returns them one at a time prior to new parses, but I  
wouldn't recommend parsing a huge BLAST XML file with hundreds of  
queries as you'll quickly run out of memory!

If they get Perl SAX2 up to date with Expat they'll eventually add  
parse_chunk() and pause_parse() for each parser.  Until then...

chris

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign







More information about the Bioperl-l mailing list