[Bioperl-l] SearchIO speed up
Sendu Bala
bix at sendu.me.uk
Mon Aug 14 14:34:02 UTC 2006
aaron.j.mackey at gsk.com wrote:
> I'm failing to understand, sorry.
>
> The UNIX utility "more" (or "less" if you prefer) is a pull parser; it
> reads the stream as much as it needs to satisfy the current iteration (the
> next iteration occurring when the user asks for an additional screen or
> line). It does not copy data from a pipe into temp storage.
>
> That said, you can't use "more" to page backwards in piped content (unless
> your "more" is keeping a buffer, which some do).
Exactly, 'more' can work like this because it only ever has to read
chunks in linear file order, and when you want a different order it has
store everything read in memory. (Which is something we'd like to avoid
doing.)
> So, I agree that you will need some form of storage for the *current*
> information to be parsed (and must process all of the stream necessary to
> obtain all such information), but not for any of the information yet to be
> accessed.
Think of this:
User creates a new SearchIO for a foobar report. Ideally no significant
work is done.
User requests report-statistic Y, which is found on the last line of the
report. We want to avoid reading, storing and parsing the entire file
just to find Y, so we seek to the last line, parse Y out and return it.
Yay, super fast.
Now the user requests the next_result(). Let's say the first result
begins 5 lines into the file after the header. We quickly seek() there
and...
Oops, our input file was piped so we can't seek.
There are two solutions to the problem:
# Don't allow seeking around, read and cache all data as you pass it in
search of the information you need. This is slower and more memory
hungry than necessary for all parsing cases where the user does not
request 100% of the information in the file.
or
# Allow seeking around. This adds an initial, possibly trivial, burden
for piped input only.
I'm going for the later solution, and my question is, is there some
magical way to avoid reading the whole piped input before we can begin
work? I'm thinking no, but I thought I'd put the question out there in
case someone had dealt with something similar and found a solution.
More information about the Bioperl-l
mailing list