[Bioperl-l] SearchIO speed up

Sendu Bala bix at sendu.me.uk
Mon Aug 14 14:34:02 UTC 2006


aaron.j.mackey at gsk.com wrote:
> I'm failing to understand, sorry.
> 
> The UNIX utility "more" (or "less" if you prefer) is a pull parser; it 
> reads the stream as much as it needs to satisfy the current iteration (the 
> next iteration occurring when the user asks for an additional screen or 
> line).  It does not copy data from a pipe into temp storage.
> 
> That said, you can't use "more" to page backwards in piped content (unless 
> your "more" is keeping a buffer, which some do).

Exactly, 'more' can work like this because it only ever has to read 
chunks in linear file order, and when you want a different order it has 
store everything read in memory. (Which is something we'd like to avoid 
doing.)


> So, I agree that you will need some form of storage for the *current* 
> information to be parsed (and must process all of the stream necessary to 
> obtain all such information), but not for any of the information yet to be 
> accessed.

Think of this:

User creates a new SearchIO for a foobar report. Ideally no significant 
work is done.

User requests report-statistic Y, which is found on the last line of the 
report. We want to avoid reading, storing and parsing the entire file 
just to find Y, so we seek to the last line, parse Y out and return it. 
Yay, super fast.

Now the user requests the next_result(). Let's say the first result 
begins 5 lines into the file after the header. We quickly seek() there 
and...

Oops, our input file was piped so we can't seek.

There are two solutions to the problem:
# Don't allow seeking around, read and cache all data as you pass it in 
search of the information you need. This is slower and more memory 
hungry than necessary for all parsing cases where the user does not 
request 100% of the information in the file.

or

# Allow seeking around. This adds an initial, possibly trivial, burden 
for piped input only.

I'm going for the later solution, and my question is, is there some 
magical way to avoid reading the whole piped input before we can begin 
work? I'm thinking no, but I thought I'd put the question out there in 
case someone had dealt with something similar and found a solution.



More information about the Bioperl-l mailing list