[Bioperl-l] SearchIO speed up
aaron.j.mackey at gsk.com
aaron.j.mackey at gsk.com
Mon Aug 14 15:56:01 UTC 2006
> User requests report-statistic Y, which is found on the last line of the
> report. We want to avoid reading, storing and parsing the entire file
> just to find Y, so we seek to the last line, parse Y out and return it.
> Yay, super fast.
This was the bit I was missing, thanks; to be honest, I never knew we had
a get_result(Y) method, I thought we only had next_result() iterators. Oh
wait, we don't, but you're proposing we should extend the API to offer
one?
The only thing we do have is a "result_count" method that is defined has
returning the number of results that "have been parsed" (which, to me,
could differ from the number of results that "have already been, or could
yet to be, parsed")
> Now the user requests the next_result(). Let's say the first result
> begins 5 lines into the file after the header. We quickly seek() there
> and...
Yes, I understand that pipes aren't seekable. I didn't understand the
non-streaming context in which you wanted to seek back up the stream.
> # Allow seeking around. This adds an initial, possibly trivial, burden
> for piped input only.
OK, if you insist on the need for "get_result(Y)" functionality, then (as
you say) you must use a buffer/cache mechanism (switching from in-memory
to tempfile above some threshold is another wrinkle to consider). But,
consider emulating XML::Twig's "purge_up_to" mechanism, whereby after I
call "get_result(Y)", I can also call "purge_upto(Y)" to release/minimize
the buffer contents.
The reason I'm being so fussy about this is that a primary motivation for
a shockingly-fast parser is shockingly large datasets that we keep only as
compressed files, uncompressing them en route to the parser; thus your
simple "I'll just copy the stream to tempfile and proceed as normal"
solution is not so trivial.
Here's a compromise: assume that users won't need random access to their
results, only sequential; also, provide a new parameter to the searchIO
constructor to specifify the desired access mode as random; then, if the
input stream is not seekable (which is testable), you can perform your
memory/file caching. If get_result(X) is called without the access mode
being set to random on an unseekable stream, throw an (informative) error.
Yes, I realize this is a bit more work; but the result could actually be
usable!
-Aaron
More information about the Bioperl-l
mailing list