[DAS2] feature group assembly; proposals for simplification

Tue Sep 19 17:06:24 UTC 2006

I wrote a writeback parser for the current XML style, and although I did not
add code to reject multi-rooted groups (which may not be appropriate
anyway), didn't find the book-keeping to be particularly onerous.

If I understand clearly, the complaint isn't the book-keeping itself, but
rather the memory requirements imposed by the book-keeping.  Why not just
give HTTP 413 (request entity too large) if you don't like the size of the
file being uploaded?  Gregg and I had a discussion about likely writeback
document sizes during the last code sprint, and Genoviz is likely to be
giving documents in the 1-50KB range -- nowhere near 80MB of GFF3 worth of
features.

-Allen

On 9/11/06, Andrew Dalke <dalke at dalkescientific.com> wrote:
>
> I've been working on a writeback server.  It will verify that the
> feature groups have no cycles, that if X is a parent to Y then Y
> is a part of X, and that all groups has a single root.
>
> I'm having a hard time with that, and harder than I expected.
>
> GFF3 had only one direction of relationship.  As such it's impossible
> to assemble a feature group until the end of the file or the
> marker that no lookahead is needed.
>
> We changed that in DAS2.  The feature xml is bidirectional so
> in theory it's possible to know when the feature group is complete.
> But it's tricky.  It's tricky enough that I want to change things
> slightly so that people don't need to handle the trickiness.
>
> The trickiness comes when parsing the list of features into
> feature groups.  For example, consider
>
>         [F3]
>         /  \
>     [F4]   [F2]
>       |      |
>     [F1]   [F5]
>
> where the features are in the order F1, F2, ... F5.  After F2
> the system looks like there are two feature groups.
>
>            [F3?]
>              |
>     [F4?]  [F2]
>       |      |
>     [F1]   [F5?]
>
> Only after F3 can those be merged together.  This requires some
> non-trivial bookkeeping, unless I've forgotten something simple
> from undergraduate data structures.
>
> Of course it's simple if you know that a feature is the last feature
> in a feature group either through reaching EOF or a special marker.
> But then what's the point of having bidirectional links if the
> result is no better than GFF3's only-list-parent solution.
>
> If there is a simple algorithm, please let me know.
>
>     === Solution #1 ===
>
> Another solution is to require that complex feature groups (groups
> with more than one feature) must also have a link to the root element
> of the feature group.  I bought this up before but agreed with others
> that that there wasn't a need for it.  Now I think there is.
>
> Here's an example.
>
>    <FEATURE uri="blah" root="uri/for/root/feature" />
>      <PARENT ... />
>      <PART ... />
>    </FEATURE>
>
> By using a 'root' attribute, detecting the end of a feature group is
> almost trivial:
>
>    a FeatureGroup contains:
>       - list of seen urls    # duplicates are not allowed
>       - set of urls which must be seen  # duplicates are ignored
>
>    let feature_groups :=  mapping {root uri -> FeatureGroup }
>
>    for feature in features:
>      if feature does not have a @root attribute:
>         make a new FeatureGroup
>         add the feature to the FeatureGroup as being seen
>         let feature_groups[feature's @uri attribute] := the new
> FeatureGroup
>
>      else:
>         if the features's @root attribute does not exist in
> features_group:
>            # first time this feature group was seen
>            create a new FeatureGroup
>            let feature_groups[feature's @root attribute] := the new
> FeatureGroup
>
>         get feature_group[feature's @root attribute]
>         add this feature to the FeatureGroup as a seen url
>         for each uri in (feature's @uri attribute, the parent uris, the
> part uris):
>              add the uri to the FeatureGroup's "must be seen" set
>         if count(seen urls) == count(must be seen urls):
>              the feature group is complete / assemble the links
>
> Assembly of a feature group occurs as soon as all the features are
> available,
> rather than waiting for the end.
>
> This makes life much simpler for the writeback, and I assume also for
> the client code.  Assuming the client code doesn't just wait until the
> end of the input before it does anything.
>
> Gregg?  Do you wait until the end of the XML to assemble hierarchical
> features?
> If so, do you need parent/part or will parent suffice?  Or do you do
> all the
> bookkeeping as you get the data?  How complex is the code?
>
> There are other solutions:
>
>    === Solution #2 ===
>   - require that the features are ordered so that parents comes before a
> part
>
> I think this is hard because it relational databases aren't naturally
> ordered.
> The normal trick is to put an extra field and "sort by", but then the
> server
> has to maintain the correct ordering.  It's fragile.
>
>    === Solution #3 ===
>
>    - put all <FEATURE> elements for a given feature group explicitly
> inside of a <FEATURE_GROUP> element.
>
> Eg,
>
> <FEATURES>
>    <FEATURE_GROUP>
>      <FEATURE .. schema is unchanged />
>      <FEATURE .. schema is unchanged />
>      <FEATURE .. schema is unchanged />
>    </FEATURE_GROUP>
>    <FEATURE .. this is a simple feature; structure unchanged from usual
> />
>
> (in this case simple features not part of a complex parent/part
> relationship
> need not be in a FEATURE_GROUP.)
>
> This is the easiest solution.  I like it because it's the easiest to
> figure
> out.  Even the algorithm above is hard by comparison.
>
> If I had my choice we would do this instead of determining the feature
> group
> by analysis of the parent/part linkages.
>
> Note that with this change there's no longer need for the PART element.
> We would only need PARENT.
>
>                                         Andrew
>                                         dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2
>