[Bioperl-l] SearchIO speed up
Chris Fields
cjfields at uiuc.edu
Mon Aug 14 18:04:04 UTC 2006
On Aug 14, 2006, at 11:24 AM, Sendu Bala wrote:
> aaron.j.mackey at gsk.com wrote:
>>> User requests report-statistic Y, which is found on the last line
>>> of the
>>
>>> report. We want to avoid reading, storing and parsing the entire
>>> file
>>> just to find Y, so we seek to the last line, parse Y out and
>>> return it.
>>> Yay, super fast.
>>
>> This was the bit I was missing, thanks; to be honest, I never knew
>> we had
>> a get_result(Y) method, I thought we only had next_result()
>> iterators. Oh
>> wait, we don't, but you're proposing we should extend the API to
>> offer
>> one?
>
> It's subtle. There's no explicit methods defined at the SearchIO
> level,
> but currently you have to parse data (or not - we want to pull) to
> find
> out things that all result (or even hit, hsp) objects need. You may
> need
> to do some internal, optional parsing depending on the specific file
> format variation you discover you are parsing.
>
> And then of course the idea is that this is nested, so the parser for
> the result data is a Bio::Search::Result::ResultI but also a pull-
> parser
> in its own right (and so on for HitI and HSPI) with a need for
> random-access to the various bits of data needed to answer all the
> various methods of ResultI.
>
>
>> The reason I'm being so fussy about this is that a primary
>> motivation for
>> a shockingly-fast parser is shockingly large datasets that we keep
>> only as
>> compressed files, uncompressing them en route to the parser; thus
>> your
>> simple "I'll just copy the stream to tempfile and proceed as normal"
>> solution is not so trivial.
>
> Right, that's helpful. I'll keep that in mind.
>
>
>> Here's a compromise: assume that users won't need random access to
>> their
>> results, only sequential; also, provide a new parameter to the
>> searchIO
>> constructor to specifify the desired access mode as random; then,
>> if the
>> input stream is not seekable (which is testable), you can perform
>> your
>> memory/file caching. If get_result(X) is called without the
>> access mode
>> being set to random on an unseekable stream, throw an
>> (informative) error.
>
> I currently have a -piped_behaviour argument that accepts 'memory' or
> 'temp_file'. How about a third (non-default) option of 'linear' to
> avoid
> any attempt at a seek and just use the data as it is piped? The
> trouble
> is that you'd need to virtually implement the methods of a parser
> module
> twice, once where the methods can seek, second where they can't. Or
> maybe not; I'll have to try and see if some sane compromise
> implementation is possible.
My worry : would it obfuscate/compromise code having both sequential
and random access available in the same module? Thins is something
you also seem concerned about. I would focus on getting one
implementation running (whichever is furthest along, which sounds
like 'random access') with the knowledge of adding sequential access
at some point. If it's too hard to fit in sequential without
compromising your code then maybe have a separate set of classes
specifically handle sequential access.
Once the basic code is out anyone interested can test it out; then we
can offer suggestions, add code, etc.
One suggestion: Bio::DB::WebDBSeqI-implementing classes have a
parameter, retrieval_type(), for setting how the data stream is
processed from a server (io_string, tempfile, pipeline). You could
have a similar get/set with expanded arguments (using parameters if
you want) based on the input stream (tempfile, piped) and how you
want to process it (random, sequential).
$parser->retrieval_type( -stream => 'tempfile', -access => 'random');
# or similar
The options could be sorted out in the method using _rearrange(),
which adds some flexibility. Of course, you wouldn't need 'access'
parameter if you split these into two classes.
Another thing also to keep in mind is interoperability. There are
more BioPerl Windows users now than in previous years (I was one but
now I'm Mac-tified). I don't think it will be a problem except with
piping/forking (and that's only 'maybe') but you never know! If
anything throws a wrench into the works it'll be DOS/Windows.
Once you have some test code committed I'll try it out on WinXP. Mac
OS X shouldn't be a problem but I'll try it there as well.
No pressure Sendu!
Chris
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list