[Bioperl-l] sequence filtering

Hilmar Lapp hlapp@gnf.org
Wed, 9 Oct 2002 18:53:51 -0700


I implemented this now as Bio::Seq::SeqBuilder (interface in 
Bio::Factory::ObjectBuilderI). You can configure quite a few things 
quite easily. There's also a test in t/SeqBuilder.t that may give an 
idea of what you can do. I adopted this capability in the genbank 
parser (swiss + embl yet to follow).

I did some timing with a 2000 sequence chunk of RefSeq human mRNAs:

oldparser: 40 wallclock secs (36.52 usr +  0.00 sys = 36.52 CPU)
fullrecords: 40 wallclock secs (37.65 usr +  0.00 sys = 37.65 CPU)
fullbutshortonly: 25 wallclock secs (24.30 usr +  0.00 sys = 24.30 CPU)
headerandseq: 12 wallclock secs (11.62 usr +  0.00 sys = 11.62 CPU)
headerandseqbutshortonly: 11 wallclock secs ( 9.40 usr +  0.00 
sys =  9.40 CPU)
headeronly: 11 wallclock secs ( 9.50 usr +  0.00 sys =  9.50 CPU)

<footnote>
Header means no annotation, no features, no sequence, no species. 
Short only means skip to next seq if length >= 2500bp, resulting in 
1288 instead of 2000 returned seqs. Oldparser is the code before 
adding these capabilities (v1.64).
</footnote>

Conclusions:

1) the speed penalty to pay for querying $builder->want_slot() and 
$builder->want_object() repeatedly for each entry is negligible 
(~2.5%).

2) the speed-up achievable by not parsing the content for certain 
slots like features is quite significant: at least ~4 fold. (Note 
that RefSeq mRNA entries don't contain long feature tables nor very 
long sequences, and hence the speed-up may be bigger for other types 
of entries.)

3) Skipping entries early (the 'shortonly' timed example kicks in 
after the LOCUS line) reduces the parsing time proportionally to the 
percentage of entries skipped, if the kept entries are parsed in 
full. If the kept entries are not parsed in full (at least not the 
feature table), skipping a significant percentage may only slightly 
decrease overall time.

4) Since I doubt that it's species, references, and comment parsing 
that makes the big difference between timings for 'fullrecord' and 
'headerandseq', I suspect it's the feature tables that cost us ~60% 
or more of the total parsing time (and the RefSeq FTs aren't very 
special).

I wonder how much time we could save by replacing some regexps by 
hard-coded (sub)string matches. Maybe not much, but maybe some.

I'll commit the stuff as soon as I can reach the dev server again. 
Any feedback welcomed.

	-hilmar

On Wednesday, October 9, 2002, at 12:13 AM, Ewan Birney wrote:

>
>
> On Tue, 8 Oct 2002, Hilmar Lapp wrote:
>
>> Apparently (unfortunately) it didn't ring a lot of bells for many 
>> people.
>> I'm still looking forward how Biojava does this exactly, even 
>> though I've
>> now looked through some of their interfaces.
>>
>> It seems to me what they do is not terribly different from the 
>> following
>> interface Bio::Factory::ObjectBuilderI that I propose as a 
>> solution. There
>> would be an implementation Bio::Seq::SeqBuilder.
>
> I think this is a pretty sane process - in some way the "builder"
> configuration object has to expose what it wants (and what it wants 
> is set
> by the user), with things like
>
> (in the middle of SeqIO)
>
>     if( $seqbuilder->want_slot('features') ) {
>       # feature table parsing
>     }
>
>
> I think if you do it in an event based way, you insert a filter 
> into the
> EventStream, ie, there is an interface:
>
>      SeqIOEventGeneratorI
>
> which the parsers comply to, producing Event::SeqIO objects of pre
> digested chunks of data, but not objectfied or fully parsed.
>
>
> One then puts in something with Is-A SeqIOEventGeneratorI and has-a
> SeqIOEventGeneratorI called
>
>     SeqIO::Event::Filter
>
>       methods - keep_event, discard_event
>
>
> which will discard "non relevant events"
>
>
>
> I don't know what the best pattern is here - the event pipeline is 
> nice,
> but the low level parser has to parse the whole file and because of
> formatting differences, is probably going to have a significant 
> overhead
> so maybe your slot method is better...
>
>
>
>>
>> =head2 want_slot
>>
>>  Title   : want_slot
>>  Usage   :
>>  Function: Whether or not the object builder wants to populate the
>>            specified slot of the object to be built.
>>
>>            The slot can be specified either as the name of the
>>            respective method, or the initialization parameter that
>>            would be otherwise passed to new() of the object to be
>>            built.
>>
>>  Example :
>>  Returns : TRUE if the object builder wants to populate the slot, and
>>            FALSE otherwise.
>>  Args    : the name of the slot (a string)
>>
>>
>> =cut
>>
>> =head2 add_slot_value
>>
>>  Title   : add_slot_value
>>  Usage   :
>>  Function: Adds one or more values to the specified slot of the object
>>            to be built.
>>
>>            Naming the slot is the same as for want_slot().
>>
>>            The object builder may further filter the content to be
>>            set, or even completely ignore the request.
>>
>>            If this method reports failure, the caller should not add
>>            more values to the same slot. In addition, the caller may
>>            find it appropriate to abandon the object being built
>>            altogether.
>>
>>  Example :
>>  Returns : TRUE on success, and FALSE otherwise
>>  Args    : the name of the slot (a string)
>>            parameters determining the value to be set
>>
>>
>> =cut
>>
>> =head2 want_object
>>
>>  Title   : want_object
>>  Usage   :
>>  Function: Whether or not the object builder is still interested in
>>            continuing with the object being built.
>>
>>            If this method returns FALSE, the caller should not add any
>>            more values to slots, or otherwise risks that the builder
>>            throws an exception. In addition, make_object() is likely
>>            to return undef after this method returned FALSE.
>>
>>  Example :
>>  Returns : TRUE if the object builder wants to continue building
>>            the present object, and FALSE otherwise.
>>  Args    : none
>>
>>
>> =cut
>>
>> =head2 make_object
>>
>>  Title   : make_object
>>  Usage   :
>>  Function: Get the built object.
>>
>>            This method is allowed to return undef if no value has ever
>>            been added since the last call to make_object(), or if
>>            want_object() returned FALSE (or would have returned FALSE)
>>            before calling this method.
>>
>>  Example :
>>  Returns : the object that was built
>>  Args    : none
>>
>>
>> =cut
>>
>> What do people think?
>>
>> 	-hilmar
>> --
>> -------------------------------------------------------------
>> Hilmar Lapp                            email: lapp@gnf.org
>> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
>> -------------------------------------------------------------
>>
>> On Tue, 8 Oct 2002, Hilmar Lapp wrote:
>>
>>> I'm trying to pull the daily full RefSeq cumulative update 
>>> through bioperl. Before even getting my hands dirty, I realized 
>>> that this can't work because there are full chromosomes in there, 
>>> and their sequences will choke perl. OTOH, I'm not interested in 
>>> those anyway and ideally I can just skip over sequences some 
>>> property of which match some pattern.
>>>
>>> Like always, there is more than one way to make this work, and 
>>> I'm wondering what could be the (subjectively :) 'best' way in 
>>> the absence of event-based parsing. Some options that crossed my 
>>> mind:
>>>
>>> a) pass an optional additional parameter to next_seq() which is a 
>>> closure returning TRUE if the entry is to be parsed and returned 
>>> and FALSE otherwise. For this option the questions would be, when 
>>> to call this function (every line, every 'item', before feature 
>>> table, before sequence, any combination of those?), and what to 
>>> pass to the closure as argument (a hash map with properties? an 
>>> instantiated Bio::SeqI object? the current line? the current slot 
>>> that was parsed and its value? something else?).
>>>
>>> b) create a SeqFilterI interface and pass an object implementing 
>>> it. This is really just a more OO-form of a) and the same kind of 
>>> questions need to be answered.
>>>
>>> c) sending events to an event listener, and skipping over the 
>>> sequence if any of the listeners returns FALSE (i.e., join by 
>>> AND). This is again very similar to a) but more flexible but also 
>>> more heavy-weight (more method calls). Again, similar kinds of 
>>> questions would need to be answered in order to define 
>>> SeqParseEventI or a similar interface.
>>>
>>> I'd be glad to hear anyone's thoughts on this. Also, I'm sure 
>>> there are better ways. If you know one, I'd be glad to learn.
>>>
>>> My preference goes for simplicity, and so far I don't think a) is 
>>> that bad, although it does lack some flexibility.
>>>
>>> 	-hilmar
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@bioperl.org
>> http://bioperl.org/mailman/listinfo/bioperl-l
>>
>
>
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------