[Bioperl-l] Refactoring Locations...

Lincoln Stein lstein@cshl.org
Mon, 1 Jul 2002 12:44:52 -0400


I'm going to defend my position, but this will be my last word on the subject 
(this isn't worth extended discussion or a flame war).

0)  Going to space-oriented coordinates makes our code simpler, less buggy, 
and makes it easier to add new modules.

1) If we keep the API the same, then external applications won't need to know 
we made the change.  The only apps that will break is those that broke 
encapsulation by going directly to the hash.

2) We have to rewrite BioPerl from the ground up next year in any case in 
order to support perl 6.0.

Lincoln


On Saturday 29 June 2002 07:25 am, Ewan Birney wrote:
> On Fri, 28 Jun 2002, Chris Mungall wrote:
> > I second this. gadfly works in space-oriented coordinates. you have to be
> > super-rigorous in import/export but otherwise it's a much better system,
> > it's ridiculous having to import an awkward fuzzy system for representing
> > insertions/splice sites etc.
> >
> > is it really too late to have us switch to this system? I can't see how
> > it would be done without extreme pain but I think it'd be worth it in the
> > end. bioperl2.0?
>
> I say no. Really.
>
>
> We have 20 years of legacy in inclusive coordinates. As much as I would
> love to work in half open coordinates, the number of
> bugs/misunderstandings and idiocies that will go on is too much.
>
>
> In tight projects (eg Gadfly, my own Wise2 package) where everyone is 100%
> mind synced, I think one can make the change, and it is much nicer to
> program in. But in Bioperl, with this loose distribution of people we just
> can't do it.
>
>
> I vote STRONG no. We stick to what has been published/stored/used for the
> last 20 years. +1 is not that hard to put in.
>
> > On Fri, 28 Jun 2002, Lincoln Stein wrote:
> > > The suggested refactoring sounds correct.  I prefer IN-BETWEEN to TWEEN
> > > or TWIXT.
> > >
> > > As a meta comment, life would be much easier if positions were
> > > described (perhaps internally) as zero-based half open intervals, which
> > > is the way that all sensible graphics code does it (I first learned the
> > > concepts working with Apple's QuickDraw).  In half-open intervals, the
> > > coordinates refer to the spaces between the nucleotides, rather than to
> > > the nucleotides themselves. For the dinucleotide AG, the following
> > > mappings hold:
> > >
> > > 	coordinate		sequence
> > >
> > > 	(0,1)			A
> > > 	(0,2)			AG
> > > 	(1,1)			space between A & G
> > >
> > > Note that in half-open intervals, the length of the sequence is always
> > > end minus start, and that you can do coordinate arithmetic withoug
> > > adding and subtracting 1's.
> > >
> > > Lincoln
> > >
> > > On Thursday 27 June 2002 12:34 pm, Heikki Lehvaslaiho wrote:
> > > > I ran into a small problem with Bio::Locations and would like to
> > > > slightly refactor them.
> > > >
> > > >  From my point of view there are three types of exact sequence
> > > > locations which in feature table notation are: 23, 34..55 and 46^47.
> > > > The first two are handled by Bio::Location::Simple and have
> > > > location_type('EXACT'). The last one is lumped into
> > > > location_type('BETWEEN') together with locations like 46^78 and
> > > > handled by Bio::Location::Fuzzy. The source for the confusion is that
> > > > the feature table definition allows for locations like 46^78 which I
> > > > do not think are used anywhere. To stress, notation 46^47 is
> > > > essential when you have clean insertions between residues.
> > > >
> > > >
> > > > Currently we have Bio::LocationI which defines the interface,
> > > > Bio::Location::Simple and two subclasses of Simple:
> > > > Bio::Location::Fuzzy and Bio::Location::Split.
> > > >
> > > > What I'd like to have is to rename the current Simple into Atomic to
> > > > be a common superclass and recreate Bio::Location::Simple so that it
> > > > can have two values for the method location_type(): 'EXACT' and 
> > > > 'IN-BETWEEN' ('TWEEN', 'TWIXT' ?). The object will throw an error if
> > > > location_type() is 'TWEEN' and start() and end() are both defined and
> > > > not adjacent. The length of 'TWIXT' location is always zero. The
> > > > default value of location_type() will be 'EXACT'.
> > > >
> > > >
> > > > In practice the code changes seem to be easy to make and there might
> > > > even be slight speed increase: Current Simple does some thing
> > > > slightly convoluted way because methods are inherited by Fuzzy and
> > > > Split. Using Bio::Location::Simple in scripts and other modules is
> > > > made more complicated only if you are conserned about insertions
> > > > (your should be!). You can then test either location_type() or
> > > > lenght().
> > > >
> > > >
> > > > The only other place in bioperl core outside Bio::Location that I
> > > > have found to be affected is FTHelper.pm where one more condition
> > > > needs to be added.
> > > >
> > > >
> > > > I have almost all the code changes ready for committing.
> > > >
> > > > Any comments?
> > > >
> > > > 	-Heikki
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
>
> -----------------------------------------------------------------
> Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
> <birney@ebi.ac.uk>.
> -----------------------------------------------------------------

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
========================================================================