[Bioperl-l] SearchIO speed up
aaron.j.mackey at gsk.com
aaron.j.mackey at gsk.com
Thu Aug 10 20:43:52 UTC 2006
> As I understand your description, this is exactly what I do. My 'chunks'
> are the hashes that are normally used to create a new Hit/HSP object.
>
> The initial parse of the data file results in a small number of objects
> (Results) that contain all the data: HSP data nested in Hit data nested
> in the Result objects. When you actually want to do something with a
> certain hit or HSP it becomes an object, allowing you to call its
> methods like normal.
>
> Or are you suggesting something that would be even better than that? If
> so, please elucidate! :)
So the only lazyness you invoke is the object instantiation (but you've
already done all the parsing).
My proposal involves the "chunks" being unparsed, raw text "blobs", that
are essentially blessed into a package that does the parsing only when
necessary (and even then, might choose different parsing strategies, based
on what's been asked for). Thus a potentially large amount of parsing and
storage is skipped. Additionally, you now have the option of not even
storing the blobs in memory, just file seek pointers (requiring temp.
storage for streaming pipe data sources), and thus can process very large
reports without consuming memory (currently a problem).
Just to reiterate, here's some "user level" code with comments describing
what's happened behind the scenes:
use Bio::SearchIO;
my $io = Bio::SearchIO->new(-format => "blast", -file =>
"myresult.blast");
# when next_result is called, $io has to do the top-level parse
# to figure out the start/stop of the next result
while (my $result = $io->next_result()) {
# $result is now a "blessed" blob
my $query = $result->query(); # blob got (minimally) lazily parsed
# to extract the requested bit, nothing
more
# first time next_hit is called, $result has to do the next-level parse
# to figure out the start(s)/stop(s) of each hit; for BLAST reports,
this
# info is in two places, the hit table and the alignment info
while (my $hit = $result->next_hit()) {
# etc.
}
}
note also that the current "push" event model can still work with this
architecture, but a "pull" model would speed up initial access even more
(preventing the need to parse/store the entire enumeration of blobs to get
the first/next), and lower the memory footprint even further.
-Aaron
More information about the Bioperl-l
mailing list