[Bioperl-l] location parsing refactored

Hilmar Lapp hlapp@gnf.org
Sun, 11 Aug 2002 21:53:52 -0700


I introduced Bio::Factory::LocationFactoryI with currently one 
method: from_string(), to return a Bio::LocationI object.

Also, I rewrote the feature table location string parsing (the old 
code in FTHelper.pm is so hideous that understanding it would have 
taken the same time as rewriting) in the new module 
Bio::Factory::FTLocationFactory, which implements the above 
interface. I also added extensive testing in t/LocationFactory.t, 
which basically goes through all examples of the Genbank FT 
documentation. Tests include a test whether rendered location string 
is equal to original location string.

All those tests pass. To achieve that I had to fix a couple of 
problems in the LocationI implementations, too.

Note that a join() on the complementary strand will always be 
rendered as join(complement(...),complement(...),...), instead of 
the equivalent complement(join(...)).

My rewrite does not contain 'short cuts' to object initialization 
anymore that directly mess with the hash ref. If someone needs a 
speed-up I'm sure there are ways to achieve this other than to 
create time bombs.

Next thing to do is integrate this new framework with 
SeqIO/FTHelper. Once this is done, people need to bang on it 
extensively to see where it breaks (I'm sure I haven't covered every 
weird location string in Genbank and swissprot). If you have a 
trouble-maker at hand, just add it to the tests in 
t/LocationFactory.t (it's near the top, shouldn't be hard to figure 
out; otherwise just send it to me).

	-hilmar

On Friday, August 9, 2002, at 10:54  AM, Jason Stajich wrote:

>
> To push this onto the subfeatures as you suggest is going to take a 
> fair
> amount of refactoring in the current parsing code but would probably be
> the best idea.
>
> No one has been brave enough to go in there and mess with things very
> much.  A full refactor is okay with me but we sort of already did that
> when we moved to locations/seqfeature separation.  We also have the
> problem that we support hierarchical features AND hierarchical 
> locations.
> At the BoF we discussed describing coding conventions to insure that
> people follow a convention that works.  I'm still unclear what the 
> right
> path ahead should be.
>
>
> I'd suggest that 1st we derive a set of test cases which break the
> expected semantics, put these in a new test file or as part of 
> t/SeqIO.t
> show that the parser currently does the wrong thing and then set about
> trying to fix it.  This should also test that the expected DNA is 
> returned
> from all of these cases as well.  If we have a test system in place 
> that
> does this properly we'll have a much better time tracking down 
> errors and
> being consistent.
>
> I think cases like:
>
> complement(join(1..200,205..300),complement(500..600))
> join(complement(1..200),205..300,complement(500..600))
>
> need to be properly tested
>
>
>
> On Thu, 8 Aug 2002, Hilmar Lapp wrote:
>
>>
>>
>>> -----Original Message-----
>>> From: Ewan Birney [mailto:birney@ebi.ac.uk]
>>> Sent: Thursday, August 08, 2002 8:30 AM
>>> To: Chris Mungall
>>> Cc: Hilmar Lapp; Elia Stupka; Jason Stajich; bioperl-l@bioperl.org
>>> Subject: Re: [Bioperl-l] *major* error in genbank parser or am i just
>>> insane?
>>>
>> [...]
>>>
>>> I do prefer chris' semantics to having to hold onto the
>>> difference between
>>> a parent complement and a child complement - ie, I think we should
>>> implicitly only allow the complement to happen on simple sequence
>>> locations and never splits, and genbank with an outer complement 
>>> is an
>>> implicit distributive complement and reverse of its components.
>>>
>>
>> OK. So this is a vote for sublocs on strand -1 and splitloc 
>> strandless, right?
>>
>> Even though this differs from the present implementation, I 
>> actually think too this is saner. So my vote goes here too. Jason? 
>> (I know we decided otherwise in Canada :o)
>>
>> 	-hilmar
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@bioperl.org
>> http://bioperl.org/mailman/listinfo/bioperl-l
>>
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
>
>
>
>
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------