[DAS2] feature locations

Lincoln Stein lstein at cshl.edu
Mon Aug 21 19:17:14 UTC 2006


From: "Lincoln Stein" <lincoln.stein at gmail.com>
To: "Andrew Dalke" <dalke at dalkescientific.com>
Date: Mon, 21 Aug 2006 15:04:09 -0400
Subject: Re: [DAS2] feature locations
On 8/18/06, Andrew Dalke <dalke at dalkescientific.com> wrote:
>
> [ I hope to hear a response before the end of the sprint today. ]
>
> For those not in the phone conference call today there were several
> issues which didn't get resolve regarding feature locations:
>
>    1) do we need multiple locations on a feature?  (vs 0 or 1 location)
>          (I argue this is mostly a data modeling issue as I can
>           decompose anything to a set of features with at most 1
>           location.)



Yes, because a feature may be discontinuous. This feature won't be used very
often, however, and simple servers might simply refuse to handle such
features.


  2) if a child has a location is its parent required to have
>         locations which includes the child locations? (currently no)



No. Parent/child relationships are defined by functional/biological
relationships and not by genomic coordinates. For example, a C. elegans
transcript is assembled from discontinuous regions of the genome (the mRNA
on one chromosome, the spliced leader on the other), and enforcing
restriction (2) would make it impossible to represent nematode genomes, the
most populous multicellular organism on earth.


  3) if #2, is the parent required to have a single location per
>        each segment? ie, if there are children on a given segment
>        then the parent must have a single location on that segment where
>               start_location <= min(children.start_location)
>               end_location >= max(children.end_location)



N/A

  4) how is the feature search done?


A feature may have multiple locations. If any of its locations matches the
range query, then the feature, plus its parents and children, is returned.
There is no "transitive" matching. That is, if the query consists of a
feature type plus a range, then IT IS NOT appropriate to return a feature if
its child matches the range and the feature itself matches the type. The
query should only return a feature if both the feature's type and location
matches.

Lincoln

Here's what I think is the problem question.
>
>     Feature X is the parent of Y and Z with
>        Y.location = (10,20) and Z.location = (50, 60)
>
>     What do you get from an overlap(30, 40) search?
>
> In the way I've been thinking about it, this returns nothing.  None
> of the features have locations which overlap that range.
>
> I gather that others want this to return {X,Y,Z} and do so
> because X should be assigned the location (10, 60).  X cannot
> be location-less.
>
>
> I don't know enough DNA to give an example of something for
> which a location makes no sense.  I think in proteins.  Consider
> X = "catalytic site" with Y and Z denoting regions essential
> to catalysis.
>
> The section between Y and Z has nothing to do with "catalytic
> site".  Automatically including that range in X makes no sense.
> For that matter, Y and Z may be on different segments.
>
> Hence I don't like #3.  It doesn't make sense for some data types.
> (Now it may be that certain data types must work this way.  But
> that's up to users of features of that type.  A database could
> enforce those cases but a dumb database shouldn't be required to
> know all types.)
>
>
> Without the extra qualification of #3 then here's a dead simple
> way to implement #2 -
>
>    parent_locations = { all of its children locations }
>
> Hence in my test case:
>    Y has 1 location (10, 20)
>    Z has 1 location (50, 60)
>    ---> X has two locations (10, 20) and (50, 60)
>
> That perfectly agrees with #2.  But only because we support
> multiple locations.  We need multiple locations because
> we have features which span multiple segments.  Hence the
> additional restriction required to make #3.
>
> If #2 is in place then I'll argue that a client should
> only put in the union of the regions because unless it
> knows the type it doesn't know if the min/max single
> location make sense.
>
>
> Please let me know if I'm on the right track before going
> onwards with search.
>
>                                         Andrew
>                                         dalke at dalkescientific.com
>
> ______________________________ _________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2
>



--
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu


-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu



More information about the DAS2 mailing list