[Bioperl-l] Refactoring Locations...

Heikki Lehvaslaiho heikki@ebi.ac.uk
Mon, 01 Jul 2002 18:10:12 +0100


In my opinion this is worth considering for perl 6.

	-Heikki

Lincoln Stein wrote:
> I'm going to defend my position, but this will be my last word on the subject 
> (this isn't worth extended discussion or a flame war).
> 
> 0)  Going to space-oriented coordinates makes our code simpler, less buggy, 
> and makes it easier to add new modules.
> 
> 1) If we keep the API the same, then external applications won't need to know 
> we made the change.  The only apps that will break is those that broke 
> encapsulation by going directly to the hash.
> 
> 2) We have to rewrite BioPerl from the ground up next year in any case in 
> order to support perl 6.0.
> 
> Lincoln
> 
> 
> On Saturday 29 June 2002 07:25 am, Ewan Birney wrote:
> 
>>On Fri, 28 Jun 2002, Chris Mungall wrote:
>>
>>>I second this. gadfly works in space-oriented coordinates. you have to be
>>>super-rigorous in import/export but otherwise it's a much better system,
>>>it's ridiculous having to import an awkward fuzzy system for representing
>>>insertions/splice sites etc.
>>>
>>>is it really too late to have us switch to this system? I can't see how
>>>it would be done without extreme pain but I think it'd be worth it in the
>>>end. bioperl2.0?
>>
>>I say no. Really.
>>
>>
>>We have 20 years of legacy in inclusive coordinates. As much as I would
>>love to work in half open coordinates, the number of
>>bugs/misunderstandings and idiocies that will go on is too much.
>>
>>
>>In tight projects (eg Gadfly, my own Wise2 package) where everyone is 100%
>>mind synced, I think one can make the change, and it is much nicer to
>>program in. But in Bioperl, with this loose distribution of people we just
>>can't do it.
>>
>>
>>I vote STRONG no. We stick to what has been published/stored/used for the
>>last 20 years. +1 is not that hard to put in.
>>
>>
>>>On Fri, 28 Jun 2002, Lincoln Stein wrote:
>>>
>>>>The suggested refactoring sounds correct.  I prefer IN-BETWEEN to TWEEN
>>>>or TWIXT.
>>>>
>>>>As a meta comment, life would be much easier if positions were
>>>>described (perhaps internally) as zero-based half open intervals, which
>>>>is the way that all sensible graphics code does it (I first learned the
>>>>concepts working with Apple's QuickDraw).  In half-open intervals, the
>>>>coordinates refer to the spaces between the nucleotides, rather than to
>>>>the nucleotides themselves. For the dinucleotide AG, the following
>>>>mappings hold:
>>>>
>>>>	coordinate		sequence
>>>>
>>>>	(0,1)			A
>>>>	(0,2)			AG
>>>>	(1,1)			space between A & G
>>>>
>>>>Note that in half-open intervals, the length of the sequence is always
>>>>end minus start, and that you can do coordinate arithmetic withoug
>>>>adding and subtracting 1's.
>>>>
>>>>Lincoln
>>>>
>>>>On Thursday 27 June 2002 12:34 pm, Heikki Lehvaslaiho wrote:
>>>>
>>>>>I ran into a small problem with Bio::Locations and would like to
>>>>>slightly refactor them.
>>>>>
>>>>> From my point of view there are three types of exact sequence
>>>>>locations which in feature table notation are: 23, 34..55 and 46^47.
>>>>>The first two are handled by Bio::Location::Simple and have
>>>>>location_type('EXACT'). The last one is lumped into
>>>>>location_type('BETWEEN') together with locations like 46^78 and
>>>>>handled by Bio::Location::Fuzzy. The source for the confusion is that
>>>>>the feature table definition allows for locations like 46^78 which I
>>>>>do not think are used anywhere. To stress, notation 46^47 is
>>>>>essential when you have clean insertions between residues.
>>>>>
>>>>>
>>>>>Currently we have Bio::LocationI which defines the interface,
>>>>>Bio::Location::Simple and two subclasses of Simple:
>>>>>Bio::Location::Fuzzy and Bio::Location::Split.
>>>>>
>>>>>What I'd like to have is to rename the current Simple into Atomic to
>>>>>be a common superclass and recreate Bio::Location::Simple so that it
>>>>>can have two values for the method location_type(): 'EXACT' and 
>>>>>'IN-BETWEEN' ('TWEEN', 'TWIXT' ?). The object will throw an error if
>>>>>location_type() is 'TWEEN' and start() and end() are both defined and
>>>>>not adjacent. The length of 'TWIXT' location is always zero. The
>>>>>default value of location_type() will be 'EXACT'.
>>>>>
>>>>>
>>>>>In practice the code changes seem to be easy to make and there might
>>>>>even be slight speed increase: Current Simple does some thing
>>>>>slightly convoluted way because methods are inherited by Fuzzy and
>>>>>Split. Using Bio::Location::Simple in scripts and other modules is
>>>>>made more complicated only if you are conserned about insertions
>>>>>(your should be!). You can then test either location_type() or
>>>>>lenght().
>>>>>
>>>>>
>>>>>The only other place in bioperl core outside Bio::Location that I
>>>>>have found to be affected is FTHelper.pm where one more condition
>>>>>needs to be added.
>>>>>
>>>>>
>>>>>I have almost all the code changes ready for committing.
>>>>>
>>>>>Any comments?
>>>>>
>>>>>	-Heikki
>>>>
>>>_______________________________________________
>>>Bioperl-l mailing list
>>>Bioperl-l@bioperl.org
>>>http://bioperl.org/mailman/listinfo/bioperl-l
>>
>>-----------------------------------------------------------------
>>Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
>><birney@ebi.ac.uk>.
>>-----------------------------------------------------------------
> 
> 


-- 
______ _/      _/_____________________________________________________
       _/      _/                      http://www.ebi.ac.uk/mutations/
      _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
     _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
    _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
   _/  _/  _/  Cambs. CB10 1SD, United Kingdom
      _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________