[Bioperl-l] Problems with Bio::SearchIO

Tue Nov 11 00:32:23 UTC 2008

On Nov 10, 2008, at 4:29 PM, Dan Bolser wrote:

> 2008/11/7 Chris Fields <cjfields at illinois.edu>:
>> On Nov 7, 2008, at 8:27 AM, Dan Bolser wrote:
>>> ...
>>> Looking closer I found that $parser->result_count() only gets set
>>> after calling $parser->next_result. Any way to force this? In some
>>> Perl objects I've seen a 'parse' method that kicks the object into
>>> (silently) calling all its get methods. Is there an equivalent (but
>>> apparently undocumented) method? Actually, I think it should kick
>>> itself when called... or not? Certainly the docs do not suggest that
>>> is won't return a the number of results ("Function: Gets the  
>>> number of
>>> Blast results that have been parsed.") So I think this is a bug.
>>
>> We could make it so that the result_count() is eager (parses the  
>> results and
>> reports the total back).  Not sure, but we could optionally cache the
>> already-parsed Result objects (that could run into memory issues if  
>> one is
>> parsing a ton of reports, so it needs to be off by default).
>
> I see (I think). Anyone first calling result_count() and *then*
> iterating over the results is getting a performance hit by effectively
> parsing the results twice? I would suggest that you make this function
> eager, but document the potential performance issue so that people can
> choose not to call it first. However, I don't think I can have
> understood correctly. How can its value be set correctly after calling
> next() only once?

It's highly possible that result_count() is meant to indicate total  
ResultI iteration parsed up to the point of being called (as opposed  
to the total number of ResultI), but that isn't made exactly clear.   
However, judging by the naming of the other Bio::Search methods for  
total objects (num_hits, num_hsps) I think that's the case.

However, if it's meant to be the total number of ResultI then  
result_count() should be eagerly called.  It must essentially run out  
the iterator and return the total number of results whether we  
implement caching or not, otherwise it isn't returning the correct  
value.  BTW, resetting the iterator also relies on the input being  
seekable (which it easily may not), so caching ResultI probably should  
be made optionally available.

>>> ...
>>> The closest hits in the mailing list that I could find to these  
>>> probemes
>>> were:
>>>
>>> http://lists.open-bio.org/pipermail/bioperl-l/2002-May/007936.html
>>> http://lists.open-bio.org/pipermail/bioperl-l/2002-September/009586.html
>>>
>>> but I don't think that they are relevant.
>>>
>>> Since it comes up here, how is the 'best' HSP defined? it isn't
>>> documented as far as I can tell.
>>
>> 'best' - when comparing HSP data to the summary hit table (in text  
>> output
>> only), the highest scoring HSP represent the hit (highest score/ 
>> raw_score,
>> lowest evalue).
>
> Which?

Right now I think it's going by evalue/pvalue, but this is dependent  
on the BLAST report.

>>> About the documentation... looking here:
>>>
>>> http://search.cpan.org/~birney/bioperl-1.4/Bio/SearchIO.pm
>>>
>>>
>>> Several of the structured methods 'blocks' are followed by a "See
>>> Bio::..." link to other pages in CPAN. However the 'next_result'
>>> method is followed by a link to
>>> http://search.cpan.org/~birney/bioperl-1.4/Bio/Root/RootI.pm - I  
>>> think
>>> it should be a link to
>>> http://search.cpan.org/~birney/bioperl-1.4/Bio/Search/Result/ResultI.pm
>>>
>>> Also, it would be nice (especially for noobs) if the full list of
>>> accepted format codes were given on that page. The current text "#
>>> format can be 'fasta', 'blast', 'exonerate', ..." is extremely
>>> frustrating for a beginner "... what?!". I now realize that each
>>> format code is matched by a "Bio::SearchIO::formatcode" module,  
>>> but I
>>> didn't know that from reading the above.
>>>
>>> While I'm at it, on page
>>> http://search.cpan.org/~birney/bioperl-1.4/Bio/Search/Hit/HitI.pm -
>>> the phrase "Equivalent to raw_score()" appearing under the heading
>>> "score" is a broken link. In fact every "See also : $this->method()"
>>> type link on that page is broken (there are about 25). Also the link
>>> to "See also : BUGS" is broken.
>>
>> The pdoc documentation is better and more up-to-date (unfortunately  
>> the
>> bioperl-1.4 CPAN docs are out-of-date but always come up first, I  
>> think b/c
>> of the stable release status).

1.5.2 is also in CPAN and is more up-to-date, but is labeled a dev  
release so doesn't pop up immediately.

>>>> User feedback is an integral part of the evolution of this and  
>>>> other
>>>> Bioperl modules. Send your comments and suggestions preferably to  
>>>> one of the
>>>> Bioperl mailing lists. Your participation is much appreciated.
>>>
>>> Thank you for your participation! I hope the above can help in some
>>> way, and I hope its not down to me making trivial mistakes! If these
>>> look like genuine bugs, should I report them on RT?
>>
>> No, use the bugzilla set up.  We do not use CPAN's RT and generally  
>> redirect
>> any bugs to bugzilla.
>>
>>> Out of interest, I did get some fail while testing, specifically (or
>>> perhaps coincidentally) some related to SearchIO...
>>>
>>> ./Build test verbose=1 > test.results.dump &> test.results.dump
>>>
>>> ...
>>> Dan.
>>
>> Those are due to the changes I have been making (using svn code is  
>> bleeding
>> edge!).
>>
>>> P.S. I've also been attacking the wiki, so please undo any mess  
>>> that I
>>> may have made there.
>
>
> Thanks very much for the detailed reply. Overall, would you recommend
> that I use SVN or 1.4 or 1.5.2?
>
> All the best,
>
> Dan.

np.  Feel free to update the wiki (the more clear the docs are the  
better).

-c