[DAS2] feature group assembly; proposals for simplification
Andrew Dalke
dalke at dalkescientific.com
Tue Sep 12 01:44:09 UTC 2006
I've been working on a writeback server. It will verify that the
feature groups have no cycles, that if X is a parent to Y then Y
is a part of X, and that all groups has a single root.
I'm having a hard time with that, and harder than I expected.
GFF3 had only one direction of relationship. As such it's impossible
to assemble a feature group until the end of the file or the
marker that no lookahead is needed.
We changed that in DAS2. The feature xml is bidirectional so
in theory it's possible to know when the feature group is complete.
But it's tricky. It's tricky enough that I want to change things
slightly so that people don't need to handle the trickiness.
The trickiness comes when parsing the list of features into
feature groups. For example, consider
[F3]
/ \
[F4] [F2]
| |
[F1] [F5]
where the features are in the order F1, F2, ... F5. After F2
the system looks like there are two feature groups.
[F3?]
|
[F4?] [F2]
| |
[F1] [F5?]
Only after F3 can those be merged together. This requires some
non-trivial bookkeeping, unless I've forgotten something simple
from undergraduate data structures.
Of course it's simple if you know that a feature is the last feature
in a feature group either through reaching EOF or a special marker.
But then what's the point of having bidirectional links if the
result is no better than GFF3's only-list-parent solution.
If there is a simple algorithm, please let me know.
=== Solution #1 ===
Another solution is to require that complex feature groups (groups
with more than one feature) must also have a link to the root element
of the feature group. I bought this up before but agreed with others
that that there wasn't a need for it. Now I think there is.
Here's an example.
<FEATURE uri="blah" root="uri/for/root/feature" />
<PARENT ... />
<PART ... />
</FEATURE>
By using a 'root' attribute, detecting the end of a feature group is
almost trivial:
a FeatureGroup contains:
- list of seen urls # duplicates are not allowed
- set of urls which must be seen # duplicates are ignored
let feature_groups := mapping {root uri -> FeatureGroup }
for feature in features:
if feature does not have a @root attribute:
make a new FeatureGroup
add the feature to the FeatureGroup as being seen
let feature_groups[feature's @uri attribute] := the new
FeatureGroup
else:
if the features's @root attribute does not exist in
features_group:
# first time this feature group was seen
create a new FeatureGroup
let feature_groups[feature's @root attribute] := the new
FeatureGroup
get feature_group[feature's @root attribute]
add this feature to the FeatureGroup as a seen url
for each uri in (feature's @uri attribute, the parent uris, the
part uris):
add the uri to the FeatureGroup's "must be seen" set
if count(seen urls) == count(must be seen urls):
the feature group is complete / assemble the links
Assembly of a feature group occurs as soon as all the features are
available,
rather than waiting for the end.
This makes life much simpler for the writeback, and I assume also for
the client code. Assuming the client code doesn't just wait until the
end of the input before it does anything.
Gregg? Do you wait until the end of the XML to assemble hierarchical
features?
If so, do you need parent/part or will parent suffice? Or do you do
all the
bookkeeping as you get the data? How complex is the code?
There are other solutions:
=== Solution #2 ===
- require that the features are ordered so that parents comes before a
part
I think this is hard because it relational databases aren't naturally
ordered.
The normal trick is to put an extra field and "sort by", but then the
server
has to maintain the correct ordering. It's fragile.
=== Solution #3 ===
- put all <FEATURE> elements for a given feature group explicitly
inside of a <FEATURE_GROUP> element.
Eg,
<FEATURES>
<FEATURE_GROUP>
<FEATURE .. schema is unchanged />
<FEATURE .. schema is unchanged />
<FEATURE .. schema is unchanged />
</FEATURE_GROUP>
<FEATURE .. this is a simple feature; structure unchanged from usual
/>
(in this case simple features not part of a complex parent/part
relationship
need not be in a FEATURE_GROUP.)
This is the easiest solution. I like it because it's the easiest to
figure
out. Even the algorithm above is hard by comparison.
If I had my choice we would do this instead of determining the feature
group
by analysis of the parent/part linkages.
Note that with this change there's no longer need for the PART element.
We would only need PARENT.
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list