[DAS2] feature locations

Andrew Dalke dalke at dalkescientific.com
Fri Aug 18 21:55:55 UTC 2006


[ I hope to hear a response before the end of the sprint today. ]

For those not in the phone conference call today there were several
issues which didn't get resolve regarding feature locations:

   1) do we need multiple locations on a feature?  (vs 0 or 1 location)
         (I argue this is mostly a data modeling issue as I can
          decompose anything to a set of features with at most 1
          location.)

   2) if a child has a location is its parent required to have
        locations which includes the child locations? (currently no)

   3) if #2, is the parent required to have a single location per
       each segment? ie, if there are children on a given segment
       then the parent must have a single location on that segment where
              start_location <= min(children.start_location)
              end_location >= max(children.end_location)

   4) how is the feature search done?

Here's what I think is the problem question.

    Feature X is the parent of Y and Z with
       Y.location = (10,20) and Z.location = (50, 60)

    What do you get from an overlap(30, 40) search?

In the way I've been thinking about it, this returns nothing.  None
of the features have locations which overlap that range.

I gather that others want this to return {X,Y,Z} and do so
because X should be assigned the location (10, 60).  X cannot
be location-less.


I don't know enough DNA to give an example of something for
which a location makes no sense.  I think in proteins.  Consider
X = "catalytic site" with Y and Z denoting regions essential
to catalysis.

The section between Y and Z has nothing to do with "catalytic
site".  Automatically including that range in X makes no sense.
For that matter, Y and Z may be on different segments.

Hence I don't like #3.  It doesn't make sense for some data types.
(Now it may be that certain data types must work this way.  But
that's up to users of features of that type.  A database could
enforce those cases but a dumb database shouldn't be required to
know all types.)


Without the extra qualification of #3 then here's a dead simple
way to implement #2 -

   parent_locations = { all of its children locations }

Hence in my test case:
   Y has 1 location (10, 20)
   Z has 1 location (50, 60)
   ---> X has two locations (10, 20) and (50, 60)

That perfectly agrees with #2.  But only because we support
multiple locations.  We need multiple locations because
we have features which span multiple segments.  Hence the
additional restriction required to make #3.

If #2 is in place then I'll argue that a client should
only put in the union of the regions because unless it
knows the type it doesn't know if the min/max single
location make sense.


Please let me know if I'm on the right track before going
onwards with search.

					Andrew
					dalke at dalkescientific.com




More information about the DAS2 mailing list