[DAS2] feature locations

Fri Aug 18 23:33:44 UTC 2006

I think all of us this morning, except you,  want

2) Yes, parent region must encompass all child regions
3) Yes, a single segment that encompasses all child regions
4) In your example:
  overlaps(30,40) returns the whole parent and child
  inside(30,40) returns neither the parent nor the child

The user (client) is responsible for asking for things that make sense.
For mRNA transcripts and exons, an overlaps query is sensible.

Here is my two cents about the "catalytic site" you talk about....

I agree that a "catalytic site" such as you describe requires some
thought.  But it requires thought from the curator on how to describe
it, not smartness of the DAS server itself.  If the catalytic site is
composed of parts of exons on a single mRNA, they should be maybe be put
into a parent-child relationship.  If different components of the
catalytic site are on different mRNAs that fold-up and combine into a
complex compound (like hemoglobin) then the parts that are on different
mRNAs probably should be treated as different features.  Or even more
simply, there could be a feature type "catalytic site component" that
can be a "part of" an exon.

Anyway, that is *my* opinion.  #2 Yes, #3 Yes, and #4 the annotator is
responsible for being smart.

I can at least see now why you think there might be a problem, but I
don't agree that it is a problem.

-----Original Message-----
From: das2-bounces at lists.open-bio.org
[mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke
Sent: Friday, August 18, 2006 2:56 PM
To: DAS/2
Subject: [DAS2] feature locations

[ I hope to hear a response before the end of the sprint today. ]

For those not in the phone conference call today there were several
issues which didn't get resolve regarding feature locations:

   1) do we need multiple locations on a feature?  (vs 0 or 1 location)
         (I argue this is mostly a data modeling issue as I can
          decompose anything to a set of features with at most 1
          location.)

   2) if a child has a location is its parent required to have
        locations which includes the child locations? (currently no)

   3) if #2, is the parent required to have a single location per
       each segment? ie, if there are children on a given segment
       then the parent must have a single location on that segment where
              start_location <= min(children.start_location)
              end_location >= max(children.end_location)

   4) how is the feature search done?

Here's what I think is the problem question.

    Feature X is the parent of Y and Z with
       Y.location = (10,20) and Z.location = (50, 60)

    What do you get from an overlap(30, 40) search?

In the way I've been thinking about it, this returns nothing.  None
of the features have locations which overlap that range.

I gather that others want this to return {X,Y,Z} and do so
because X should be assigned the location (10, 60).  X cannot
be location-less.

I don't know enough DNA to give an example of something for
which a location makes no sense.  I think in proteins.  Consider
X = "catalytic site" with Y and Z denoting regions essential
to catalysis.

The section between Y and Z has nothing to do with "catalytic
site".  Automatically including that range in X makes no sense.
For that matter, Y and Z may be on different segments.

Hence I don't like #3.  It doesn't make sense for some data types.
(Now it may be that certain data types must work this way.  But
that's up to users of features of that type.  A database could
enforce those cases but a dumb database shouldn't be required to
know all types.)

Without the extra qualification of #3 then here's a dead simple
way to implement #2 -

   parent_locations = { all of its children locations }

Hence in my test case:
   Y has 1 location (10, 20)
   Z has 1 location (50, 60)
   ---> X has two locations (10, 20) and (50, 60)

That perfectly agrees with #2.  But only because we support
multiple locations.  We need multiple locations because
we have features which span multiple segments.  Hence the
additional restriction required to make #3.

If #2 is in place then I'll argue that a client should
only put in the union of the regions because unless it
knows the type it doesn't know if the min/max single
location make sense.

Please let me know if I'm on the right track before going
onwards with search.

					Andrew
					dalke at dalkescientific.com

_______________________________________________
DAS2 mailing list
DAS2 at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/das2