[DAS2] feature locations

Fri Aug 18 21:55:55 UTC 2006

[ I hope to hear a response before the end of the sprint today. ]

For those not in the phone conference call today there were several
issues which didn't get resolve regarding feature locations:

   1) do we need multiple locations on a feature?  (vs 0 or 1 location)
         (I argue this is mostly a data modeling issue as I can
          decompose anything to a set of features with at most 1
          location.)

   2) if a child has a location is its parent required to have
        locations which includes the child locations? (currently no)

   3) if #2, is the parent required to have a single location per
       each segment? ie, if there are children on a given segment
       then the parent must have a single location on that segment where
              start_location <= min(children.start_location)
              end_location >= max(children.end_location)

   4) how is the feature search done?

Here's what I think is the problem question.

    Feature X is the parent of Y and Z with
       Y.location = (10,20) and Z.location = (50, 60)

    What do you get from an overlap(30, 40) search?

In the way I've been thinking about it, this returns nothing.  None
of the features have locations which overlap that range.

I gather that others want this to return {X,Y,Z} and do so
because X should be assigned the location (10, 60).  X cannot
be location-less.

I don't know enough DNA to give an example of something for
which a location makes no sense.  I think in proteins.  Consider
X = "catalytic site" with Y and Z denoting regions essential
to catalysis.

The section between Y and Z has nothing to do with "catalytic
site".  Automatically including that range in X makes no sense.
For that matter, Y and Z may be on different segments.

Hence I don't like #3.  It doesn't make sense for some data types.
(Now it may be that certain data types must work this way.  But
that's up to users of features of that type.  A database could
enforce those cases but a dumb database shouldn't be required to
know all types.)

Without the extra qualification of #3 then here's a dead simple
way to implement #2 -

   parent_locations = { all of its children locations }

Hence in my test case:
   Y has 1 location (10, 20)
   Z has 1 location (50, 60)
   ---> X has two locations (10, 20) and (50, 60)

That perfectly agrees with #2.  But only because we support
multiple locations.  We need multiple locations because
we have features which span multiple segments.  Hence the
additional restriction required to make #3.

If #2 is in place then I'll argue that a client should
only put in the union of the regions because unless it
knows the type it doesn't know if the min/max single
location make sense.

Please let me know if I'm on the right track before going
onwards with search.

					Andrew
					dalke at dalkescientific.com