[Bioperl-l] location parsing refactored
Hilmar Lapp
hlapp@gnf.org
Sun, 11 Aug 2002 21:53:52 -0700
I introduced Bio::Factory::LocationFactoryI with currently one
method: from_string(), to return a Bio::LocationI object.
Also, I rewrote the feature table location string parsing (the old
code in FTHelper.pm is so hideous that understanding it would have
taken the same time as rewriting) in the new module
Bio::Factory::FTLocationFactory, which implements the above
interface. I also added extensive testing in t/LocationFactory.t,
which basically goes through all examples of the Genbank FT
documentation. Tests include a test whether rendered location string
is equal to original location string.
All those tests pass. To achieve that I had to fix a couple of
problems in the LocationI implementations, too.
Note that a join() on the complementary strand will always be
rendered as join(complement(...),complement(...),...), instead of
the equivalent complement(join(...)).
My rewrite does not contain 'short cuts' to object initialization
anymore that directly mess with the hash ref. If someone needs a
speed-up I'm sure there are ways to achieve this other than to
create time bombs.
Next thing to do is integrate this new framework with
SeqIO/FTHelper. Once this is done, people need to bang on it
extensively to see where it breaks (I'm sure I haven't covered every
weird location string in Genbank and swissprot). If you have a
trouble-maker at hand, just add it to the tests in
t/LocationFactory.t (it's near the top, shouldn't be hard to figure
out; otherwise just send it to me).
-hilmar
On Friday, August 9, 2002, at 10:54 AM, Jason Stajich wrote:
>
> To push this onto the subfeatures as you suggest is going to take a
> fair
> amount of refactoring in the current parsing code but would probably be
> the best idea.
>
> No one has been brave enough to go in there and mess with things very
> much. A full refactor is okay with me but we sort of already did that
> when we moved to locations/seqfeature separation. We also have the
> problem that we support hierarchical features AND hierarchical
> locations.
> At the BoF we discussed describing coding conventions to insure that
> people follow a convention that works. I'm still unclear what the
> right
> path ahead should be.
>
>
> I'd suggest that 1st we derive a set of test cases which break the
> expected semantics, put these in a new test file or as part of
> t/SeqIO.t
> show that the parser currently does the wrong thing and then set about
> trying to fix it. This should also test that the expected DNA is
> returned
> from all of these cases as well. If we have a test system in place
> that
> does this properly we'll have a much better time tracking down
> errors and
> being consistent.
>
> I think cases like:
>
> complement(join(1..200,205..300),complement(500..600))
> join(complement(1..200),205..300,complement(500..600))
>
> need to be properly tested
>
>
>
> On Thu, 8 Aug 2002, Hilmar Lapp wrote:
>
>>
>>
>>> -----Original Message-----
>>> From: Ewan Birney [mailto:birney@ebi.ac.uk]
>>> Sent: Thursday, August 08, 2002 8:30 AM
>>> To: Chris Mungall
>>> Cc: Hilmar Lapp; Elia Stupka; Jason Stajich; bioperl-l@bioperl.org
>>> Subject: Re: [Bioperl-l] *major* error in genbank parser or am i just
>>> insane?
>>>
>> [...]
>>>
>>> I do prefer chris' semantics to having to hold onto the
>>> difference between
>>> a parent complement and a child complement - ie, I think we should
>>> implicitly only allow the complement to happen on simple sequence
>>> locations and never splits, and genbank with an outer complement
>>> is an
>>> implicit distributive complement and reverse of its components.
>>>
>>
>> OK. So this is a vote for sublocs on strand -1 and splitloc
>> strandless, right?
>>
>> Even though this differs from the present implementation, I
>> actually think too this is saner. So my vote goes here too. Jason?
>> (I know we decided otherwise in Canada :o)
>>
>> -hilmar
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@bioperl.org
>> http://bioperl.org/mailman/listinfo/bioperl-l
>>
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
>
>
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------