[Bioperl-l] SearchIO speed up

Mon Aug 14 18:04:04 UTC 2006

On Aug 14, 2006, at 11:24 AM, Sendu Bala wrote:

> aaron.j.mackey at gsk.com wrote:
>>> User requests report-statistic Y, which is found on the last line  
>>> of the
>>
>>> report. We want to avoid reading, storing and parsing the entire  
>>> file
>>> just to find Y, so we seek to the last line, parse Y out and  
>>> return it.
>>> Yay, super fast.
>>
>> This was the bit I was missing, thanks; to be honest, I never knew  
>> we had
>> a get_result(Y) method, I thought we only had next_result()  
>> iterators.  Oh
>> wait, we don't, but you're proposing we should extend the API to  
>> offer
>> one?
>
> It's subtle. There's no explicit methods defined at the SearchIO  
> level,
> but currently you have to parse data (or not - we want to pull) to  
> find
> out things that all result (or even hit, hsp) objects need. You may  
> need
> to do some internal, optional parsing depending on the specific file
> format variation you discover you are parsing.
>
> And then of course the idea is that this is nested, so the parser for
> the result data is a Bio::Search::Result::ResultI but also a pull- 
> parser
> in its own right (and so on for HitI and HSPI) with a need for
> random-access to the various bits of data needed to answer all the
> various methods of ResultI.
>
>
>> The reason I'm being so fussy about this is that a primary  
>> motivation for
>> a shockingly-fast parser is shockingly large datasets that we keep  
>> only as
>> compressed files, uncompressing them en route to the parser; thus  
>> your
>> simple "I'll just copy the stream to tempfile and proceed as normal"
>> solution is not so trivial.
>
> Right, that's helpful. I'll keep that in mind.
>
>
>> Here's a compromise: assume that users won't need random access to  
>> their
>> results, only sequential; also, provide a new parameter to the  
>> searchIO
>> constructor to specifify the desired access mode as random; then,  
>> if the
>> input stream is not seekable (which is testable), you can perform  
>> your
>> memory/file caching.  If get_result(X) is called without the  
>> access mode
>> being set to random on an unseekable stream, throw an  
>> (informative) error.
>
> I currently have a -piped_behaviour argument that accepts 'memory' or
> 'temp_file'. How about a third (non-default) option of 'linear' to  
> avoid
> any attempt at a seek and just use the data as it is piped? The  
> trouble
> is that you'd need to virtually implement the methods of a parser  
> module
> twice, once where the methods can seek, second where they can't. Or
> maybe not; I'll have to try and see if some sane compromise
> implementation is possible.

My worry : would it obfuscate/compromise code having both sequential  
and random access available in the same module?  Thins is something  
you also seem concerned about.  I would focus on getting one  
implementation running (whichever is furthest along, which sounds  
like 'random access') with the knowledge of adding sequential access  
at some point.  If it's too hard to fit in sequential without  
compromising your code then maybe have a separate set of classes  
specifically handle sequential access.

Once the basic code is out anyone interested can test it out; then we  
can offer suggestions, add code, etc.

One suggestion: Bio::DB::WebDBSeqI-implementing classes have a  
parameter, retrieval_type(), for setting how the data stream is  
processed from a server (io_string, tempfile, pipeline).  You could  
have a similar get/set with expanded arguments (using parameters if  
you want) based on the input stream (tempfile, piped) and how you  
want to process it (random, sequential).

$parser->retrieval_type( -stream => 'tempfile', -access => 'random');  
# or similar

The options could be sorted out in the method using _rearrange(),  
which adds some flexibility. Of course, you wouldn't need 'access'  
parameter if you split these into two classes.

Another thing also to keep in mind is interoperability.  There are  
more BioPerl Windows users now than in previous years (I was one but  
now I'm Mac-tified).  I don't think it will be a problem except with  
piping/forking (and that's only 'maybe') but you never know! If  
anything throws a wrench into the works it'll be DOS/Windows.

Once you have some test code committed I'll try it out on WinXP.  Mac  
OS X shouldn't be a problem but I'll try it there as well.

No pressure Sendu!

Chris

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign