[Bioperl-l] Parsing BLAST 2.2.14 output

Thu Jun 15 23:03:37 UTC 2006

...

> Hi Chris and Sendu,
> 
> Thanks for your replies.  I am using blastp from the NCBI BLAST page,
> with this input sequence:

...

> I have tried saving HTML (with and without the graphical overview),
> plain text, and XML.  I am parsing with this script:

> #!/usr/local/bin/perl -w
> 
> use Bio::SearchIO;
> ...
> }

I got this script to work.  I used your sequence and retrieved BLASTP text
output from NCBI BLASTP 2.2.14, then saved it from the web browser, and just
copied it to three separate files.  Using those files as input, they all
parse fine, with output like this:

DB All non-redundant GenBank CDStranslations+PDB+SwissProt+PIR+PRF excluding
environmental samples
 ALG BLASTP
QRY
        gi|27502689|gb|AAH42571.1|
        HSPS: 1
        gi|21779923|gb|AAM77583.1|
        HSPS: 1
...

> Interestingly, the results are different (but never correct) for the
> different types of output I've tried.  For xml, the script runs but
> produces no output, for plain text the script hangs with no output, and
> for html, I get these errors:

What's interesting is that HTML did anything at all.  You MUST strip out the
HTML tags as per the FAQ, which I pointed out before:

http://www.bioperl.org/wiki/FAQ

See the question : Does Bio::SearchIO parse the HTML output that BLAST
creates using the -T option?

Again, I would NOT attempt parsing HTML.  The only reason we have a FAQ
question about it is b/c it popped up on the list many many times in the
past (i.e. it is a FAQ) and someone found out that HTML::Strip works.  We
will never adequately support it beyond suggesting stripping the tags out.
NCBI changes their HTML output more often than their text output.

If you tried parsing XML with the format set to 'blast' you'll get nothing
(the blast text parser looks for text output using regexes, so it just
bypasses all the XML tags).  You must set:

-format => 'blastxml' 

You'll also need to install XML::SAX, and I would suggest installing
XML::SAX::ExpatXS and the Expat XML parser for your system to speed things
up.

The 'hanging' you mention using text parsing sounds like the old bug where
it got caught in an infinite loop.  I don't have this problem.  It could be
a couple of things:

1) You have an old version of bioperl and updated Bio::SearchIO, but you
haven't updated Bio::SearchIO::blast. That's the plugin module where the
error was (not Bio::SearchIO).  Try updating either that or install the
entire distribution from scratch.

2) You have two versions of Bioperl installed (an old one and bioperl-live)
and perl is using the old version of bioperl (and the old version of
SearchIO::blast).  Make sure you only have one version installed and that it
is bioperl-live.

> At this point I should probably try installing all of bioperl-live, or
> at least get IteratedSearchResultEventBuilder.pm - or would you
> recommend something else?  Let me know if you need more info.

If you have the entire distribution installed, you should have ISREB anyway.
ISREB (IteratedSearchResultEventBuilder) has nothing to do with the problems
here, though.

Chris

> Thanks again,
> -susan