[DAS2] feature group assembly; proposals for simplification

Andrew Dalke dalke at dalkescientific.com
Tue Sep 12 01:44:09 UTC 2006


I've been working on a writeback server.  It will verify that the
feature groups have no cycles, that if X is a parent to Y then Y
is a part of X, and that all groups has a single root.

I'm having a hard time with that, and harder than I expected.

GFF3 had only one direction of relationship.  As such it's impossible
to assemble a feature group until the end of the file or the
marker that no lookahead is needed.

We changed that in DAS2.  The feature xml is bidirectional so
in theory it's possible to know when the feature group is complete.
But it's tricky.  It's tricky enough that I want to change things
slightly so that people don't need to handle the trickiness.

The trickiness comes when parsing the list of features into
feature groups.  For example, consider

        [F3]
        /  \
    [F4]   [F2]
      |      |
    [F1]   [F5]

where the features are in the order F1, F2, ... F5.  After F2
the system looks like there are two feature groups.

           [F3?]
             |
    [F4?]  [F2]
      |      |
    [F1]   [F5?]

Only after F3 can those be merged together.  This requires some
non-trivial bookkeeping, unless I've forgotten something simple
from undergraduate data structures.

Of course it's simple if you know that a feature is the last feature
in a feature group either through reaching EOF or a special marker.
But then what's the point of having bidirectional links if the
result is no better than GFF3's only-list-parent solution.

If there is a simple algorithm, please let me know.

    === Solution #1 ===

Another solution is to require that complex feature groups (groups
with more than one feature) must also have a link to the root element
of the feature group.  I bought this up before but agreed with others
that that there wasn't a need for it.  Now I think there is.

Here's an example.

   <FEATURE uri="blah" root="uri/for/root/feature" />
     <PARENT ... />
     <PART ... />
   </FEATURE>

By using a 'root' attribute, detecting the end of a feature group is 
almost trivial:

   a FeatureGroup contains:
      - list of seen urls    # duplicates are not allowed
      - set of urls which must be seen  # duplicates are ignored

   let feature_groups :=  mapping {root uri -> FeatureGroup }

   for feature in features:
     if feature does not have a @root attribute:
        make a new FeatureGroup
        add the feature to the FeatureGroup as being seen
        let feature_groups[feature's @uri attribute] := the new 
FeatureGroup

     else:
        if the features's @root attribute does not exist in 
features_group:
           # first time this feature group was seen
           create a new FeatureGroup
           let feature_groups[feature's @root attribute] := the new 
FeatureGroup

        get feature_group[feature's @root attribute]
        add this feature to the FeatureGroup as a seen url
        for each uri in (feature's @uri attribute, the parent uris, the 
part uris):
             add the uri to the FeatureGroup's "must be seen" set
        if count(seen urls) == count(must be seen urls):
             the feature group is complete / assemble the links

Assembly of a feature group occurs as soon as all the features are 
available,
rather than waiting for the end.

This makes life much simpler for the writeback, and I assume also for
the client code.  Assuming the client code doesn't just wait until the
end of the input before it does anything.

Gregg?  Do you wait until the end of the XML to assemble hierarchical 
features?
If so, do you need parent/part or will parent suffice?  Or do you do 
all the
bookkeeping as you get the data?  How complex is the code?

There are other solutions:

   === Solution #2 ===
  - require that the features are ordered so that parents comes before a 
part

I think this is hard because it relational databases aren't naturally 
ordered.
The normal trick is to put an extra field and "sort by", but then the 
server
has to maintain the correct ordering.  It's fragile.

   === Solution #3 ===

   - put all <FEATURE> elements for a given feature group explicitly
inside of a <FEATURE_GROUP> element.

Eg,

<FEATURES>
   <FEATURE_GROUP>
     <FEATURE .. schema is unchanged />
     <FEATURE .. schema is unchanged />
     <FEATURE .. schema is unchanged />
   </FEATURE_GROUP>
   <FEATURE .. this is a simple feature; structure unchanged from usual 
/>

(in this case simple features not part of a complex parent/part 
relationship
need not be in a FEATURE_GROUP.)

This is the easiest solution.  I like it because it's the easiest to 
figure
out.  Even the algorithm above is hard by comparison.

If I had my choice we would do this instead of determining the feature 
group
by analysis of the parent/part linkages.

Note that with this change there's no longer need for the PART element.
We would only need PARENT.

					Andrew
					dalke at dalkescientific.com




More information about the DAS2 mailing list