[Bioperl-l] location parsing refactored
Hilmar Lapp
hlapp@gnf.org
Mon, 12 Aug 2002 02:03:01 -0700
I just finished adapting FTHelper and the SeqIO parsers for genbank,
embl, and swissprot to use the new LocationFactoryI framework. You
can now pass in your own location string parser if you want by
calling $seqio_stream->location_factory($mylocationfactory). The
default instance parses Genbank-compatible location strings.
All tests pass. This doesn't necessarily mean that all
Genbank/EMBL/SwissProt locations would be properly parsed. It'd be
great if someone can try it on a large-scale, e.g. entire Swissprot
or Genbank, and collect the trouble-makers. (Ideally, grab the
location string that's causing trouble and add it to the test cases
in t/LocationFactory.t.)
There are more consequences from the refactoring: as decided, the
root split location now doesn't have a strand anymore. If you set
the strand of the root split location, it propagates to all
sublocations that are not remote. $splitloc->strand() returning
undef may pose a problem for feature renderers: you can't treat
features with split locations the same anymore as those with other
location types. Instead, you need to inspect the sublocations of
$feat->location().
Is this really the behaviour we want? We could add convenience code
that returns a strand if all non-remote sublocations have the same.
Any votes on that?
Note also that I simplified the location of the last feature in
testfuzzy.genbank: complement(join(...,complement(...))) is not
really what you'd see in reality I'd think (complement of complement
is plus strand, right? However $loc->strand(-1) doesn't invert the
sign.).
-hilmar
On Sunday, August 11, 2002, at 09:53 PM, Hilmar Lapp wrote:
> I introduced Bio::Factory::LocationFactoryI with currently one
> method: from_string(), to return a Bio::LocationI object.
>
> Also, I rewrote the feature table location string parsing (the old
> code in FTHelper.pm is so hideous that understanding it would have
> taken the same time as rewriting) in the new module
> Bio::Factory::FTLocationFactory, which implements the above
> interface. I also added extensive testing in t/LocationFactory.t,
> which basically goes through all examples of the Genbank FT
> documentation. Tests include a test whether rendered location
> string is equal to original location string.
>
> All those tests pass. To achieve that I had to fix a couple of
> problems in the LocationI implementations, too.
>
> Note that a join() on the complementary strand will always be
> rendered as join(complement(...),complement(...),...), instead of
> the equivalent complement(join(...)).
>
> My rewrite does not contain 'short cuts' to object initialization
> anymore that directly mess with the hash ref. If someone needs a
> speed-up I'm sure there are ways to achieve this other than to
> create time bombs.
>
> Next thing to do is integrate this new framework with
> SeqIO/FTHelper. Once this is done, people need to bang on it
> extensively to see where it breaks (I'm sure I haven't covered
> every weird location string in Genbank and swissprot). If you have
> a trouble-maker at hand, just add it to the tests in
> t/LocationFactory.t (it's near the top, shouldn't be hard to figure
> out; otherwise just send it to me).
>
> -hilmar
>
> On Friday, August 9, 2002, at 10:54 AM, Jason Stajich wrote:
>
>>
>> To push this onto the subfeatures as you suggest is going to take
>> a fair
>> amount of refactoring in the current parsing code but would
>> probably be
>> the best idea.
>>
>> No one has been brave enough to go in there and mess with things very
>> much. A full refactor is okay with me but we sort of already did that
>> when we moved to locations/seqfeature separation. We also have the
>> problem that we support hierarchical features AND hierarchical
>> locations.
>> At the BoF we discussed describing coding conventions to insure that
>> people follow a convention that works. I'm still unclear what the
>> right
>> path ahead should be.
>>
>>
>> I'd suggest that 1st we derive a set of test cases which break the
>> expected semantics, put these in a new test file or as part of
>> t/SeqIO.t
>> show that the parser currently does the wrong thing and then set about
>> trying to fix it. This should also test that the expected DNA is
>> returned
>> from all of these cases as well. If we have a test system in
>> place that
>> does this properly we'll have a much better time tracking down
>> errors and
>> being consistent.
>>
>> I think cases like:
>>
>> complement(join(1..200,205..300),complement(500..600))
>> join(complement(1..200),205..300,complement(500..600))
>>
>> need to be properly tested
>>
>>
>>
>> On Thu, 8 Aug 2002, Hilmar Lapp wrote:
>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Ewan Birney [mailto:birney@ebi.ac.uk]
>>>> Sent: Thursday, August 08, 2002 8:30 AM
>>>> To: Chris Mungall
>>>> Cc: Hilmar Lapp; Elia Stupka; Jason Stajich; bioperl-l@bioperl.org
>>>> Subject: Re: [Bioperl-l] *major* error in genbank parser or am i
>>>> just
>>>> insane?
>>>>
>>> [...]
>>>>
>>>> I do prefer chris' semantics to having to hold onto the
>>>> difference between
>>>> a parent complement and a child complement - ie, I think we should
>>>> implicitly only allow the complement to happen on simple sequence
>>>> locations and never splits, and genbank with an outer complement
>>>> is an
>>>> implicit distributive complement and reverse of its components.
>>>>
>>>
>>> OK. So this is a vote for sublocs on strand -1 and splitloc
>>> strandless, right?
>>>
>>> Even though this differs from the present implementation, I
>>> actually think too this is saner. So my vote goes here too.
>>> Jason? (I know we decided otherwise in Canada :o)
>>>
>>> -hilmar
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@bioperl.org
>>> http://bioperl.org/mailman/listinfo/bioperl-l
>>>
>>
>> --
>> Jason Stajich
>> Duke University
>> jason at cgt.mc.duke.edu
>>
>>
>>
>>
> --
> -------------------------------------------------------------
> Hilmar Lapp email: lapp at gnf.org
> GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------