[Bioperl-l] SearchIO speed up

Chris Fields cjfields at uiuc.edu
Thu Aug 10 22:51:33 UTC 2006


...
> So the only lazyness you invoke is the object instantiation (but you've
> already done all the parsing).
> 
> My proposal involves the "chunks" being unparsed, raw text "blobs", that
> are essentially blessed into a package that does the parsing only when
> necessary (and even then, might choose different parsing strategies, based
> on what's been asked for).  Thus a potentially large amount of parsing and
> storage is skipped.  Additionally, you now have the option of not even
> storing the blobs in memory, just file seek pointers (requiring temp.
> storage for streaming pipe data sources), and thus can process very large
> reports without consuming memory (currently a problem).
> 
> Just to reiterate, here's some "user level" code with comments describing
> what's happened behind the scenes:
> 
> use Bio::SearchIO;
> 
> my $io = Bio::SearchIO->new(-format => "blast", -file =>
> "myresult.blast");
> 
> # when next_result is called, $io has to do the top-level parse
> # to figure out the start/stop of the next result
> while (my $result = $io->next_result()) {
>   # $result is now a "blessed" blob
> 
>   my $query = $result->query(); # blob got (minimally) lazily parsed
>                                 # to extract the requested bit, nothing
> more
> 
>   # first time next_hit is called, $result has to do the next-level parse
>   # to figure out the start(s)/stop(s) of each hit; for BLAST reports,
> this
>   # info is in two places, the hit table and the alignment info
>   while (my $hit = $result->next_hit()) {
>     # etc.
>   }
> }
> 
> note also that the current "push" event model can still work with this
> architecture, but a "pull" model would speed up initial access even more
> (preventing the need to parse/store the entire enumeration of blobs to get
> the first/next), and lower the memory footprint even further.
> 
> -Aaron

Using file pointers is a great touch.  Sendu has a slight aversion to temp
files but he has already indicated other ways around this.

Would be nice to see this to fruition.  Okay, really have to get back to
work!

Chris




More information about the Bioperl-l mailing list