[Bioperl-l] Help parsing PSI-BLAST XML reports
Chris Fields
cjfields at uiuc.edu
Thu Apr 5 04:14:46 UTC 2007
On Apr 4, 2007, at 8:34 PM, Torsten Seemann wrote:
> Dear all,
>
> I have been migrating all our BLAST infrastructure to use the XML
> output mode, the "blastpgp -m 7" option, referred to 'blastxml' format
> in Bioperl. I had never used SearchIO to parse a PSI-BLAST XML report
> before, and encountered some issues I hope you can help me with:
>
> 1. When loading with Bio::SearchIO(-format=>'blastxml') I get back a
> Bio::Search::Result::GenericResult object. This means I can not use
> the PSI-BLAST functions like iterations() and psiblast() provided by
> Bio::Search::Result::BlastResult. I'm guessing this is because the the
> XML output reports itself as a plain BLASTP output:
> <BlastOutput_program>blastp</BlastOutput_program>
>
> How do I determine if it is a PSI-BLAST report?
I don't know if you can very easily, though I haven't tried myself.
If I remember correctly there wasn't a substantial difference in the
XML output between regular BLAST XML and PSI-BLAST XML. We could add
a parameter to the parser to treat the report as PSI-BLAST.
> 2. Usually a PSI-BLAST report has multiple Iterations. The XML output
> has <Iteration> tags but it took me a while to figure out that these
> get mapped to Bio::SearchIO::Result objects accessible via
> Bio::SearchIO->next_result().
>
> Is this the proper way to process the iterations?
The problem is in the way that NCBI now outputs multiple-query BLAST
XML reports, which apparently changed sometime in the last year w/o
notice. This was also a problem with other Bio* parsers (I remember
seeing something about it on the BioPython list). Previously
multiquery BLAST requests were output like single XML reports
concatenated together, each with their own XML declaration, etc. Now
they are treated like iterations (query 1 = iteration 1, query 2 =
iteration 2, etc) all in one long BLAST report. There's an example
of one in the SearchIO tests which I added to CVS in Jan-Feb,
post-1.5.2. The current parser handles both old and new cases.
The current behavior of the parser is to parse everything up front,
building up the ResultI's and then returning them one-by-one upon
next_result(), which is horrible on memory if you have tons of XML to
wade through. I will probably change that to carve the data up into
report-sized chunks of XML and parse them piecemeal, but I haven't
had time to work on it yet.
> 3. I also notice that only the first result (iteration) has the
> query_name set. Subsequent ones are empty:
> RESULT 1 Bio::Search::Result::GenericResult, algorithm= BLASTP,
> query=MyProtein , db=uniprot_sprot
> RESULT 2 Bio::Search::Result::GenericResult, algorithm= BLASTP, query=
> , db=uniprot_sprot
>
> Is this a bug or expected?
If you are using 1.5.2 then there is a bug related to that which was
fixed in CVS a few months back (related to the multiquery issue
above). If it isn't let me know.
> I'm guessing a lot of these problems are simply due to limitations of
> the NCBI BLAST XML DTD?
>
> --Torsten
To tell the truth I'm not sure. One would think they could add some
designation to the report for PSI-BLAST!
chris
More information about the Bioperl-l
mailing list