[Bioperl-l] SearchIO speed up
aaron.j.mackey at gsk.com
aaron.j.mackey at gsk.com
Mon Aug 14 17:01:47 UTC 2006
> And then of course the idea is that this is nested, so the parser for
> the result data is a Bio::Search::Result::ResultI but also a pull-parser
> in its own right (and so on for HitI and HSPI) with a need for
> random-access to the various bits of data needed to answer all the
> various methods of ResultI.
the second- (and third- and so on) level parsers can work on in-memory
"blobs" (if seeking is unavailable), as these will be minute in
comparison; it's only the top-level SearchIO parser that need fuss about
streaming pipes and seekability.
> I currently have a -piped_behaviour argument that accepts 'memory' or
> 'temp_file'.
does it default to memory?
> How about a third (non-default) option of 'linear' to avoid
> any attempt at a seek and just use the data as it is piped?
fine; we can quibble about stylistic API issues later.
> The trouble
> is that you'd need to virtually implement the methods of a parser module
> twice, once where the methods can seek, second where they can't. Or
> maybe not; I'll have to try and see if some sane compromise
> implementation is possible.
fundamentally, parsing occurs when regular expressions operate on
in-memory blobs; so while you can keep lots of file pointers around to
define many largish blobs with minimal memory footprint, at some point
they need to become memory-resident for the parser to take effect.
Conversely, if you spend too much time finding out the fine-grained
locations of every parsable bit, and saving the pointers then you're
recapitulating Perl's own variable storage mechanisms.
-Aaron
More information about the Bioperl-l
mailing list