[Bioperl-l] SearchIO speed up

Thu Aug 10 17:39:59 UTC 2006

> ...Except I need to know if the community considers the speed problem 
> solved or not. More radical changes will make SearchIO even faster, eg. 
> Chris Fields and Jason (if I interpret the Project priority list item 
> correctly) have suggested an end to individual Hit and HSP objects, 
> which become just data members of a Result-like object. Ideally I don't 
> want to go down that route because we lose quite a bit of OO power;

As already mentioned, a lazy-evaluation approach would also work.

Jason and I did once talk about an entirely new parsing/object-building 
framework, based on nested grammars; in essence, the "top-level" parser, 
simply "chunks" the input into blobs of (minimally parsed) text that 
correspond to the top level result object.  This chunk/blob is the input 
to the next-level parser for Hits, which in return has chunk for HSPs. 
Note that the Result/Hit/HSP "chunks" are "fat", i.e. they *are* the same 
Generic*I-implementing objects we're already using.  Thus, if HSPs are 
never interrogated, they're never parsed; as soon as one is interrogated, 
it gets parsed, and so on.  In such an environment, you can imagine 
flyweight objects that are built very quickly/easily (recall that many 
previous analyses of BioPerl speed problems are not related to parsing, so 
much as heavy-weight object creation).

I happen to have such a nested parser lying around for 
Bio::SearchIO::fasta.pm, but it also uses an Inline::C, yacc-generated C 
parser backend (yet another experiment in trying to get SearchIO to run 
faster), so really isn't ready for prime time (being entirely untested, 
and probably not even finished).

-Aaron