[Bioperl-l] Problems with Bio::SearchIO

Chris Fields cjfields at illinois.edu
Fri Nov 7 21:31:06 UTC 2008


On Nov 7, 2008, at 8:27 AM, Dan Bolser wrote:

> Hi,
>
> I'm new to BioPerl, so apologies if I'm doing something wrong. Please
> just let me know what I'm doing wrong or what you need.
>
>
> I downloaded the latest BioPerl via SVN (revision 14980), and did the,
>
> perl Build.PL
> ./Build test
> ./Build install
>
> (note that I have lib::local installed, pointing things at ~/perl5)
>
> I ignored the failed tests, as-per the "everyone who is l88t does
> this" instructions ;-)

SearchIO (specifically, HSP methods used to calculate seq start/end  
correctly) are being revised in svn, so they are failing at the  
moment.  My local (not-yet-committed) revisions are failing much more,  
but that's b/c the tests are wrong; these should be added in  
relatively soon once I work out more kinks in the code.

> I am trying to parse the output of a "blastall -p blastn" job
> (blastall 2.2.18) using either the default format or tabular format. I
> copied the "search_overview.PLS" script found here:
>
> ...
>
> Looking closer I found that $parser->result_count() only gets set
> after calling $parser->next_result. Any way to force this? In some
> Perl objects I've seen a 'parse' method that kicks the object into
> (silently) calling all its get methods. Is there an equivalent (but
> apparently undocumented) method? Actually, I think it should kick
> itself when called... or not? Certainly the docs do not suggest that
> is won't return a the number of results ("Function: Gets the number of
> Blast results that have been parsed.") So I think this is a bug.

We could make it so that the result_count() is eager (parses the  
results and reports the total back).  Not sure, but we could  
optionally cache the already-parsed Result objects (that could run  
into memory issues if one is parsing a ton of reports, so it needs to  
be off by default).

> Actually I just checked against version 1.4 and "result_count()" fails
> with the default, the XML or the tabular format results (all three
> representing the same query against the same database with the same
> blast version). Note that trying with the XML format in the latest
> version of BioPerl results in the outright failure:
>
> ------------- EXCEPTION: Bio::Root::NotImplemented -------------
> MSG: Abstract method "Bio::SearchIO::result_count" is not implemented
> by package Bio::SearchIO::blastxml.
> This is not your fault - author of Bio::SearchIO::blastxml should be  
> blamed!
>
> ...

That's a bug.  Could you file this so we can track it?

http://bugzilla.open-bio.org/

> While parsing and comparing the different formats (and versions) I
> noticed a couple of other problems that I could not find reported
> anywhere, so I'll report them here.
>
> The appropriate parts of the code are:
>
> while( my $r = $parser->next_result ) {
>  print $r->query_length, "\n";
>  while(my $h = $r->next_hit ) {
>    print $h->length, "\n";
>    my $p  = $h->hsp('best');
>    print $h->significance, "\t", $h->score, "\t", $h->bits, "\n";
>  }
> }
>
> It seems that both the "$r->query_length" and the "$h->length" are
> missing when using the tabular format blast results (using either
> BioPerl version). I get confused here because there is also some
> undocumented behavior used in the above script that means a 'hit'
> (apparently) returns the start and end point of the two most extreme
> HSPs (and the length of that range).

The HSPs are tiled prior to returning values for these methods.  The  
tiling algorithm works for the most part but still has a few odd  
issues (one is reported in bugzilla).  I am not familiar with that bit  
of code so can't comment, but Sendu or Steve might answer.

I try to stay away from using the various hit methods unless necessary  
and usually go with simple HSP coordinates.

> Secondly, I found in all but the default format (using either BioPerl
> version), the "$p->score" is missing (set to an empty string). It
> shows up fine using the default format. The significance and the
> bit-score show up fine... or at least they show up... The values look
> wrong now I come to check. e.g. the score is equal to the bit-score
> when using the default format with version 1.4, and is often smaller
> than the bit-score using the latest version (or is $h->bits not the
> bit score?)

There is some discussion about this in the mail list archives:

http://thread.gmane.org/gmane.comp.lang.perl.bio.general/16273/focus=16296

NCBI BLAST has the hit summary table reporting the raw score(),  
whereas in WU_BLAST the hit table reports bit_score(); in the HSPs I  
think everything is defined.  If you have a minimal hit constructed  
(data only reported in the hit table, no HSP data is reported) some  
may be undefined.  The hit should be calculating the best overall  
score values from the contained HSPs and falling back to directly-set  
hit table data (set as above).  It is possible this may not be  
happening when parsing other formats so it may be a bug (and would be  
worth filing so we can look into it).

> The closest hits in the mailing list that I could find to these  
> probemes were:
>
> http://lists.open-bio.org/pipermail/bioperl-l/2002-May/007936.html
> http://lists.open-bio.org/pipermail/bioperl-l/2002-September/009586.html
>
> but I don't think that they are relevant.
>
> Since it comes up here, how is the 'best' HSP defined? it isn't
> documented as far as I can tell.

'best' - when comparing HSP data to the summary hit table (in text  
output only), the highest scoring HSP represent the hit (highest score/ 
raw_score, lowest evalue).

> About the documentation... looking here:
>
> http://search.cpan.org/~birney/bioperl-1.4/Bio/SearchIO.pm
>
>
> Several of the structured methods 'blocks' are followed by a "See
> Bio::..." link to other pages in CPAN. However the 'next_result'
> method is followed by a link to
> http://search.cpan.org/~birney/bioperl-1.4/Bio/Root/RootI.pm - I think
> it should be a link to
> http://search.cpan.org/~birney/bioperl-1.4/Bio/Search/Result/ 
> ResultI.pm
>
> Also, it would be nice (especially for noobs) if the full list of
> accepted format codes were given on that page. The current text "#
> format can be 'fasta', 'blast', 'exonerate', ..." is extremely
> frustrating for a beginner "... what?!". I now realize that each
> format code is matched by a "Bio::SearchIO::formatcode" module, but I
> didn't know that from reading the above.
>
> While I'm at it, on page
> http://search.cpan.org/~birney/bioperl-1.4/Bio/Search/Hit/HitI.pm -
> the phrase "Equivalent to raw_score()" appearing under the heading
> "score" is a broken link. In fact every "See also : $this->method()"
> type link on that page is broken (there are about 25). Also the link
> to "See also : BUGS" is broken.

The pdoc documentation is better and more up-to-date (unfortunately  
the bioperl-1.4 CPAN docs are out-of-date but always come up first, I  
think b/c of the stable release status).

>> User feedback is an integral part of the evolution of this and  
>> other Bioperl modules. Send your comments and suggestions  
>> preferably to one of the Bioperl mailing lists. Your participation  
>> is much appreciated.
>
> Thank you for your participation! I hope the above can help in some
> way, and I hope its not down to me making trivial mistakes! If these
> look like genuine bugs, should I report them on RT?

No, use the bugzilla set up.  We do not use CPAN's RT and generally  
redirect any bugs to bugzilla.

> Out of interest, I did get some fail while testing, specifically (or
> perhaps coincidentally) some related to SearchIO...
>
> ./Build test verbose=1 > test.results.dump &> test.results.dump
>
> ...
> Dan.

Those are due to the changes I have been making (using svn code is  
bleeding edge!).

> P.S. I've also been attacking the wiki, so please undo any mess that I
> may have made there.
>
>
> -- 
> http://network.nature.com/profile/dan
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign







More information about the Bioperl-l mailing list