[DAS2] complex features

Thu Mar 23 21:44:00 UTC 2006

chris:
> The GFF3 spec says that Parent can only be used to indicate part_of 
> relations. If we go by the definition of part_of in the OBO relations 
> ontology, or any other definition of part_of (there are many), then 
> cycles are explicitly verboten, although the GFF3 docs do not state 
> this.

It looks like the most recent spec at
   http://song.sourceforge.net/gff3.shtml
does state this, although the earlier one did not:

   "A Parent relationship between two features that is not one of the
    Part-Of relationships listed in SO should trigger a parse exception
    Similarly, a set of Parent relationships that would cause a cycle
    should also trigger an exception."

> There's no reason in general why part_of graphs should have a single
> root, although it's certainly desirable from a software perspective.
> Dicistronic genes thow a bit of a spanner in the works. There's nothing
> to stop you adding a fake root, or refering to the maximally connected
> graph as an entity in its own right however.

I've been working with GFF3 data for a few days now, trying to
catch the different cases.  It isn't hard, but it had been a long
time since I worried about cycle detection.

The biggest problem has been keeping all the "could be a parent"
elements around until the entire data set is finished.  Except
for features with no ID and no Parents, parsers need to go to
the end of the file (or no-forward-references line) before
being able to do anything with the data.

In DAS it's easier because each feature lists all parents and
children, so it's possible to detect when a complex feature is
ready.  Even then it requires a bit of thinking to handle cases
with multiple roots.  It would be much easier if either all
complex features were in an element

   <COMPLEX-FEATURE>
    <FEATURE id="1" />
    <FEATURE id="2" />
   </COMPLEX-FEATURE>

or if there was a unique name to tie them together

    <FEATURE id="1" complex-feature-id="A"/>
    <FEATURE id="2" complex-feature-id="A"/>

Another solution is to make the problem simpler.  I see, for
example, that the biopython doesn't have any gff code and
the biojava one only works at the single feature level.  Only
bioperl implements a gff3 parser with support for complex features,
but it assumes all complex features are single rooted and that
the features are topologically sorted, so that parents come
before children.  It also looks like a diamond structure (single
root, two children, both with the same child) is supported on
input but the output assumes features are trees.

For example, I tried it just now on dmel-4-r4.3.gff from wormbase,
which I'm finding to be a bad example of what a GFF file should
look like.  It contains one duplicate ID, which bioperl catches
and dies on.  I fixed it. It then complains with a lot of

    MSG: Bio::SeqFeature::Annotated=HASH(0xba4a93c) is not contained
    within parent feature, and expansion is not valid, ignoring.

because the features are not topologically sorted, as in this
(trimmed) example.  The order is the same as in the file.

4  sim4:na_dbEST.same.dmel match_part  5175  5627 ...
                        Parent=88682278868229;Name=GH01459.5prime
4  sim4:na_dbEST.same.dmel match   5175    5627 ...
                       ID=88682278868229;Name=GH

The simpler the data model we use (eg, single rooted, output
must be topologically sorted with parents first) then the
more likely it is for client and server code to be correct and
the more likely there will be more DAS code.

					Andrew
					dalke at dalkescientific.com