[Bioperl-l] BLASTXML changes (good this time!)
Chris Fields
cjfields at uiuc.edu
Fri Feb 9 13:58:24 UTC 2007
On Feb 8, 2007, at 10:18 PM, Torsten Seemann wrote:
> Chris,
>
>> BLAST XML parsing should now work for any CPAN-based XML::SAX parser!
>> XML::SAX::PurePerl (comes with XML::SAX, the slowest)
>> XML::SAX::Expat
>> XML::SAX::ExpatXS (the fastest)
>> XML::LibXML::SAX
>> XML::LibXML::SAX::Parser
>
> That's excellent news - thanks for all the work you have put in on
> this one. I'm impressed.
Jason did most of the hard work; I just tinkered with it until it
worked (and pestered a few perl XML guys along the way). Thanks
Grant and Björn!
> This is a good opportunity to encourage people who use Bio::SearchIO
> for BLAST parsing to switch to 'blastxml' format over 'blast'.
> Although the latter is more human readable, it perenially requires
> parser source changes to cope with the variations and new formatting
> introduced with each new NCBI BLAST release. Best to use "-m 7" XML
> format, and convert as appropriate using one of the
> Bio::Search::Writer:: classes.
>
> --Torsten
I'll try getting some benchmarks for the different parsers up today
on the wiki if I have time.
Strangely enough, NCBI changed a few things about BLAST XML a few
releases back w/o mentioning it to anyone (it was a silent bug in
BLAST XML parsing which I fixed recently). If you sent in multiple
queries in older versions of BLAST you would get all of the BLAST XML
reports concatenated together, which required preparsing the reports
to carve up the XML prior to parsing. Now they treat it like PSI-
BLAST where multiple queries = multiple iterations, so you get one
long XML BLAST report where each iteration=Result.
The current parser should handle both as it just caches the other
results and returns them one at a time prior to new parses, but I
wouldn't recommend parsing a huge BLAST XML file with hundreds of
queries as you'll quickly run out of memory!
If they get Perl SAX2 up to date with Expat they'll eventually add
parse_chunk() and pause_parse() for each parser. Until then...
chris
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list