[DAS2] feature group assembly; proposals for simplification
Allen Day
allenday at ucla.edu
Tue Sep 19 17:06:24 UTC 2006
I wrote a writeback parser for the current XML style, and although I did not
add code to reject multi-rooted groups (which may not be appropriate
anyway), didn't find the book-keeping to be particularly onerous.
If I understand clearly, the complaint isn't the book-keeping itself, but
rather the memory requirements imposed by the book-keeping. Why not just
give HTTP 413 (request entity too large) if you don't like the size of the
file being uploaded? Gregg and I had a discussion about likely writeback
document sizes during the last code sprint, and Genoviz is likely to be
giving documents in the 1-50KB range -- nowhere near 80MB of GFF3 worth of
features.
-Allen
On 9/11/06, Andrew Dalke <dalke at dalkescientific.com> wrote:
>
> I've been working on a writeback server. It will verify that the
> feature groups have no cycles, that if X is a parent to Y then Y
> is a part of X, and that all groups has a single root.
>
> I'm having a hard time with that, and harder than I expected.
>
> GFF3 had only one direction of relationship. As such it's impossible
> to assemble a feature group until the end of the file or the
> marker that no lookahead is needed.
>
> We changed that in DAS2. The feature xml is bidirectional so
> in theory it's possible to know when the feature group is complete.
> But it's tricky. It's tricky enough that I want to change things
> slightly so that people don't need to handle the trickiness.
>
> The trickiness comes when parsing the list of features into
> feature groups. For example, consider
>
> [F3]
> / \
> [F4] [F2]
> | |
> [F1] [F5]
>
> where the features are in the order F1, F2, ... F5. After F2
> the system looks like there are two feature groups.
>
> [F3?]
> |
> [F4?] [F2]
> | |
> [F1] [F5?]
>
> Only after F3 can those be merged together. This requires some
> non-trivial bookkeeping, unless I've forgotten something simple
> from undergraduate data structures.
>
> Of course it's simple if you know that a feature is the last feature
> in a feature group either through reaching EOF or a special marker.
> But then what's the point of having bidirectional links if the
> result is no better than GFF3's only-list-parent solution.
>
> If there is a simple algorithm, please let me know.
>
> === Solution #1 ===
>
> Another solution is to require that complex feature groups (groups
> with more than one feature) must also have a link to the root element
> of the feature group. I bought this up before but agreed with others
> that that there wasn't a need for it. Now I think there is.
>
> Here's an example.
>
> <FEATURE uri="blah" root="uri/for/root/feature" />
> <PARENT ... />
> <PART ... />
> </FEATURE>
>
> By using a 'root' attribute, detecting the end of a feature group is
> almost trivial:
>
> a FeatureGroup contains:
> - list of seen urls # duplicates are not allowed
> - set of urls which must be seen # duplicates are ignored
>
> let feature_groups := mapping {root uri -> FeatureGroup }
>
> for feature in features:
> if feature does not have a @root attribute:
> make a new FeatureGroup
> add the feature to the FeatureGroup as being seen
> let feature_groups[feature's @uri attribute] := the new
> FeatureGroup
>
> else:
> if the features's @root attribute does not exist in
> features_group:
> # first time this feature group was seen
> create a new FeatureGroup
> let feature_groups[feature's @root attribute] := the new
> FeatureGroup
>
> get feature_group[feature's @root attribute]
> add this feature to the FeatureGroup as a seen url
> for each uri in (feature's @uri attribute, the parent uris, the
> part uris):
> add the uri to the FeatureGroup's "must be seen" set
> if count(seen urls) == count(must be seen urls):
> the feature group is complete / assemble the links
>
> Assembly of a feature group occurs as soon as all the features are
> available,
> rather than waiting for the end.
>
> This makes life much simpler for the writeback, and I assume also for
> the client code. Assuming the client code doesn't just wait until the
> end of the input before it does anything.
>
> Gregg? Do you wait until the end of the XML to assemble hierarchical
> features?
> If so, do you need parent/part or will parent suffice? Or do you do
> all the
> bookkeeping as you get the data? How complex is the code?
>
> There are other solutions:
>
> === Solution #2 ===
> - require that the features are ordered so that parents comes before a
> part
>
> I think this is hard because it relational databases aren't naturally
> ordered.
> The normal trick is to put an extra field and "sort by", but then the
> server
> has to maintain the correct ordering. It's fragile.
>
> === Solution #3 ===
>
> - put all <FEATURE> elements for a given feature group explicitly
> inside of a <FEATURE_GROUP> element.
>
> Eg,
>
> <FEATURES>
> <FEATURE_GROUP>
> <FEATURE .. schema is unchanged />
> <FEATURE .. schema is unchanged />
> <FEATURE .. schema is unchanged />
> </FEATURE_GROUP>
> <FEATURE .. this is a simple feature; structure unchanged from usual
> />
>
> (in this case simple features not part of a complex parent/part
> relationship
> need not be in a FEATURE_GROUP.)
>
> This is the easiest solution. I like it because it's the easiest to
> figure
> out. Even the algorithm above is hard by comparison.
>
> If I had my choice we would do this instead of determining the feature
> group
> by analysis of the parent/part linkages.
>
> Note that with this change there's no longer need for the PART element.
> We would only need PARENT.
>
> Andrew
> dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2
>
More information about the DAS2
mailing list