[Bioperl-l] SearchIO speed up

aaron.j.mackey at gsk.com aaron.j.mackey at gsk.com
Thu Aug 10 20:43:52 UTC 2006


> As I understand your description, this is exactly what I do. My 'chunks' 

> are the hashes that are normally used to create a new Hit/HSP object.
> 
> The initial parse of the data file results in a small number of objects 
> (Results) that contain all the data: HSP data nested in Hit data nested 
> in the Result objects. When you actually want to do something with a 
> certain hit or HSP it becomes an object, allowing you to call its 
> methods like normal.
> 
> Or are you suggesting something that would be even better than that? If 
> so, please elucidate! :)

So the only lazyness you invoke is the object instantiation (but you've 
already done all the parsing).

My proposal involves the "chunks" being unparsed, raw text "blobs", that 
are essentially blessed into a package that does the parsing only when 
necessary (and even then, might choose different parsing strategies, based 
on what's been asked for).  Thus a potentially large amount of parsing and 
storage is skipped.  Additionally, you now have the option of not even 
storing the blobs in memory, just file seek pointers (requiring temp. 
storage for streaming pipe data sources), and thus can process very large 
reports without consuming memory (currently a problem).

Just to reiterate, here's some "user level" code with comments describing 
what's happened behind the scenes:

use Bio::SearchIO;

my $io = Bio::SearchIO->new(-format => "blast", -file => 
"myresult.blast");

# when next_result is called, $io has to do the top-level parse
# to figure out the start/stop of the next result
while (my $result = $io->next_result()) {
  # $result is now a "blessed" blob

  my $query = $result->query(); # blob got (minimally) lazily parsed
                                # to extract the requested bit, nothing 
more

  # first time next_hit is called, $result has to do the next-level parse
  # to figure out the start(s)/stop(s) of each hit; for BLAST reports, 
this
  # info is in two places, the hit table and the alignment info
  while (my $hit = $result->next_hit()) {
    # etc. 
  }
}

note also that the current "push" event model can still work with this 
architecture, but a "pull" model would speed up initial access even more 
(preventing the need to parse/store the entire enumeration of blobs to get 
the first/next), and lower the memory footprint even further.

-Aaron




More information about the Bioperl-l mailing list