[Bioperl-l] Structured (nested) Annotation

Ewan Birney birney@ebi.ac.uk
Mon, 7 Oct 2002 09:02:15 +0100 (BST)


On Sun, 6 Oct 2002, Hilmar Lapp wrote:

> I wanted to be able to represent structured, nested annotation (we 
> were there a while ago weren't we?). I made two changes/additions 
> that can accomplish this, but in different ways.
> 
> 1) Annotation collections can now be nested, because 
> Annotation::Collection now implements AnnotationI.
> 
> I added two methods dealing specifically with nested annotation: 
> get_all_Annotations() is similar to get_Annotations(), but traverses 
> the whole tree of nested collections (if there is no nesting, it 
> behaves identical to get_Annotations(). flatten_Annotations() makes 
> a nested collection un-nested.
> 
> I thought it may be a good idea to promote get_all_Annotations() to 
> the interface (AnnotationCollectionI). What do people think? 
> Ewan/Jason?
> 
> 2) Nesting through all-objects is somewhat heavy-weight if all you 
> want to nest is simple values. So I added 
> Annotation::StructuredValue which inherits from 
> Annotation::SimpleValue and can be called as if it were a simple 
> value. In addition, there are methods to add simple values in a 
> structured, nested way, and control the way the structure is 
> flattened into a single string. Also, there is get_all_values() 
> returning a flattened array of values.
> 
> My starting use case was to somehow retain the structured 
> information in swissprot GN lines.
> 
> In case you're unfamiliar with swissprot, GN gives the names of the 
> genes that give rise to the protein sequence of the entry. Different 
> genes are concatenated by ' AND ', whereas synonyms of the same gene 
> are concatenated by ' OR '. Both may co-occur in the same GN line, 
> in which case parentheses are used to group. An infamous example is 
> Calmodulin (many many species for this entry ...):
> 
> GN   (CALM1 OR CAM1 OR CALM OR CAM) AND (CALM2 OR CAM2 OR CAMB) AND
> GN   (CALM3 OR CAM3 OR CAMC).
> 
> The parser screwed this up before because it didn't know about the 
> AND join operator nor possible nesting (nor multiple GN lines). I 
> fixed all this and changed the parser to now construct a 
> Annotation::StructuredValue object for this.
> 
> NOTE: The consequence is that now you get back only _ONE_ object for 
> $seq->annotation->get_Annotations("gene_name"). You need to call 
> get_all_values() on that object to see all the names. Calling 
> value() will return the structured array flattened into a single 
> string.

I do find this a sort of "bending the object model to fit questionable 
swissprot practices" - I am going to hassle wolfgang (except he has just 
had a baby, so many not. Hmmm. Nicole) to see whether swissprot will ever
consider disentangling multiple species entries. 


Was the thought to come up with a specific "gene name" annotation object 
type, slotted in the right way rejected by you? I just think that the 
"nesting" semantics is kinda implicitly in the way the parsers interact 
with the files and objects - a more explicit system would be to have


  Bio::Annotation::GeneName
                     (has-a list of Bio::GeneNameValues)

Or something similar.


Not a good solution?