[Bioperl-l] Structured (nested) Annotation
Ewan Birney
birney@ebi.ac.uk
Mon, 7 Oct 2002 09:02:15 +0100 (BST)
On Sun, 6 Oct 2002, Hilmar Lapp wrote:
> I wanted to be able to represent structured, nested annotation (we
> were there a while ago weren't we?). I made two changes/additions
> that can accomplish this, but in different ways.
>
> 1) Annotation collections can now be nested, because
> Annotation::Collection now implements AnnotationI.
>
> I added two methods dealing specifically with nested annotation:
> get_all_Annotations() is similar to get_Annotations(), but traverses
> the whole tree of nested collections (if there is no nesting, it
> behaves identical to get_Annotations(). flatten_Annotations() makes
> a nested collection un-nested.
>
> I thought it may be a good idea to promote get_all_Annotations() to
> the interface (AnnotationCollectionI). What do people think?
> Ewan/Jason?
>
> 2) Nesting through all-objects is somewhat heavy-weight if all you
> want to nest is simple values. So I added
> Annotation::StructuredValue which inherits from
> Annotation::SimpleValue and can be called as if it were a simple
> value. In addition, there are methods to add simple values in a
> structured, nested way, and control the way the structure is
> flattened into a single string. Also, there is get_all_values()
> returning a flattened array of values.
>
> My starting use case was to somehow retain the structured
> information in swissprot GN lines.
>
> In case you're unfamiliar with swissprot, GN gives the names of the
> genes that give rise to the protein sequence of the entry. Different
> genes are concatenated by ' AND ', whereas synonyms of the same gene
> are concatenated by ' OR '. Both may co-occur in the same GN line,
> in which case parentheses are used to group. An infamous example is
> Calmodulin (many many species for this entry ...):
>
> GN (CALM1 OR CAM1 OR CALM OR CAM) AND (CALM2 OR CAM2 OR CAMB) AND
> GN (CALM3 OR CAM3 OR CAMC).
>
> The parser screwed this up before because it didn't know about the
> AND join operator nor possible nesting (nor multiple GN lines). I
> fixed all this and changed the parser to now construct a
> Annotation::StructuredValue object for this.
>
> NOTE: The consequence is that now you get back only _ONE_ object for
> $seq->annotation->get_Annotations("gene_name"). You need to call
> get_all_values() on that object to see all the names. Calling
> value() will return the structured array flattened into a single
> string.
I do find this a sort of "bending the object model to fit questionable
swissprot practices" - I am going to hassle wolfgang (except he has just
had a baby, so many not. Hmmm. Nicole) to see whether swissprot will ever
consider disentangling multiple species entries.
Was the thought to come up with a specific "gene name" annotation object
type, slotted in the right way rejected by you? I just think that the
"nesting" semantics is kinda implicitly in the way the parsers interact
with the files and objects - a more explicit system would be to have
Bio::Annotation::GeneName
(has-a list of Bio::GeneNameValues)
Or something similar.
Not a good solution?