[Bioperl-l] Bio::Location::Fuzzy, Bio::Location::Split

Hilmar Lapp lapp@gnf.org
Thu, 25 Jan 2001 13:10:23 -0800


First, I think it is better to bring this back to the list, because
users *will* be affected by the final design and implementation (i.e.,
Mark & David & others, watch out, don't complain afterwards).

Jason Stajich wrote:
> 
> So I that have really clearly solved this -
> lease correct me if any of the following statement is false. ( N is a
> location point)
> 
> - start/end can be fuzzy at both points and it could be <N (on 5') or N>
>   (on 3') at either start/end point.  However, N< and >N are invalid fuzzy
>   point descriptions.  If they are indeed true then my start_fuzzy will
>   need to be more than just (-1, 0, 1) -- (5', not fuzzy, 3') but 5
>   points (5' before, 5' after, 0, 3' before, 3' after) and I really don't
>   even know what that would mean since I would be so wrapped up in strand
>   coordinates - would think a 'complement' would simplify it ( no, not a
>   pat on the back, that's when we get to the release)
> 
> - in plain simple genbank/embl terms
>   <5..12> and <5.12>
>    are valid, but
>   >5..12, 5<..12, 5..12<, 5..>12
>   are invalid.

The GenBank documentation is somewhat inconsistent here. Let me quote:

From http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#FeaturesB

<quote>
If the "<" symbol precedes a base span, the sequence is partial on the
5' end (e.g., CDS  <1..206).  If the ">" symbol follows a base span,
the
sequence is partial on the 3' end (e.g., CDS   435..915>).
</quote>

From http://www.ncbi.nlm.nih.gov/collab/FT/index.html

<quote>
CDS             <1..>336
                /codon_start=1
                /gene="IGHV1"
                /product="immunoglobulin heavy chain variable region"
V_region        <1..>336
                /gene="IGHV1"
                /product="immunoglobulin heavy chain variable region"
</quote>

From the BNF grammar definition of the feature table, to be found at
http://www.ncbi.nlm.nih.gov/collab/FT/index.html#backus-naur

<quote>
local_location ::= <base_position> | <between_position> | <base_range> 
base_position ::= <integer> | <low_base_bound> | <high_base_bound> | 
<two_base_bound> 

low_base_bound ::= > <integer>

high_base_bound ::= < <integer>

two_base_bound ::= <base_position>.<base_position>

between_position ::= <base_position>^<base_position>

base_range ::= <base_position>..<base_position>
</quote>

The sample record link seems to be pretty new, but I'm not sure. Shall
we simply build upon the BNF? Maybe we should ask someone from NCBI.

> 
> Questions:
> 1. Do we need to override the famous pocock RangeI contains/overlaps
>    methods for a Split location to take into account where the pieces
>    of the contained LocationI are?
>    Or do we take the easy route and just use min_start/max_end?  I think
>    that right now start/end return 0 for a split location since they are
>    not explictly set, should they default to delegating to
>    min_start/max_start?  I think so.
> 
>    What about in Fuzzy, do we want to throw exceptions or do we just use
>    the best information we have and do some logic and coordinate
>    gymnastics to try and return a reasonable value or else throw an
>    exception?
> 

As I understood the comments from users, exceptions should be avoided
here whenever possible. However, since there are different policies
one can think of, a mechanism should be provided to switch between
them.

> 2. Deep Split/Fuzziness - [copying famous artwork from Ewan's latest
>    email]
> 
>              LocationI
>                ^
>                |
>       ------------------------
>   SingleLocationI        SplitLocationI
>       |                      sub_Locations defined to return  SingleLocationI array
>       |
>       -----------------
>   SimpleLocationI   FuzzyLocationI
> 
> 
> (does the above crappy ascii art make sense to you?)
> 
> I guess this says that all FuzzyLocations can be made as combination of
> a single SplitLocation with a set of FuzzyLocations.
> 
> [ end Ewan's included message ]
> 
> This is exactly what I have assumed.  I see SplitLocation as simply a
> Collection of LocationI objects some of which may be fuzzy.   The only
> problem is how to define min_start/max_end for a
> SplitLocation when the beginning and end of the locations are fuzzy?
> 
> As for deep SplitLocation (ie SplitLocation containing Location objects
> that are SplitLocations), this will work in a very gross way just like
> perl flattens arrays, except I don't plan to simplify the join(...join())
> code into a single join() unless you guys think its worth it.  It wouldn't
> be hard, just let perl collapse the arrays...
> 

Be aware that you don't lose information you need for recovering the
original location entry upon writing. If that seems to inflate the
object tree unnecessarily, we can also store the original location
string as a property. Not beautiful, but KISS is not a bad principle.

> Any other problems you guys can think of.
> 
> So close... I wonder if we should include Alan on this so we can see if
> the biocorba IDL will really handle all of this now?  I guess I could 

To my understanding BioCorba and BioPerl pretty much affect each
other, don't they? If so, we should definitely get a comment from him.

	Hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp@gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------