[Bioperl-l] Structured (nested) Annotation

Hilmar Lapp hlapp@gnf.org
Mon, 7 Oct 2002 10:44:24 -0700


> -----Original Message-----
> From: Ewan Birney [mailto:birney@ebi.ac.uk]
> Sent: Monday, October 07, 2002 1:02 AM
> To: Hilmar Lapp
> Cc: Bioperl
> Subject: Re: [Bioperl-l] Structured (nested) Annotation
> 
> 
> On Sun, 6 Oct 2002, Hilmar Lapp wrote:
> 
> > 
> > GN   (CALM1 OR CAM1 OR CALM OR CAM) AND (CALM2 OR CAM2 OR CAMB) AND
> > GN   (CALM3 OR CAM3 OR CAMC).
> > 
> > The parser screwed this up before because it didn't know about the 
> > AND join operator nor possible nesting (nor multiple GN lines). I 
> > fixed all this and changed the parser to now construct a 
> > Annotation::StructuredValue object for this.
> > 
> > NOTE: The consequence is that now you get back only _ONE_ 
> object for 
> > $seq->annotation->get_Annotations("gene_name"). You need to call 
> > get_all_values() on that object to see all the names. Calling 
> > value() will return the structured array flattened into a single 
> > string.
> 
> I do find this a sort of "bending the object model to fit 
> questionable 
> swissprot practices" - I am going to hassle wolfgang (except 
> he has just 
> had a baby, so many not. Hmmm. Nicole) to see whether 
> swissprot will ever
> consider disentangling multiple species entries. 

If you can talk swissprot into disentangling this, we'd nominate you for the 'anti-insanity in bioinformatics war hero' award ... :-)

BTW the structure in the GN lines is not only due to multiple species, but may also result from multiple genes in the same species (if they give rise to the same protein sequence).

> 
> 
> Was the thought to come up with a specific "gene name" 
> annotation object 
> type, slotted in the right way rejected by you?

Not entirely, but it doesn't change the fact that this annotation is structured and nested, and I thought why not make an attempt at abstracting this from gene names and write a generically useable module. If you do have nested values, that module lets you do nice things, like transform from one way of flattening to another. E.g., given the GN line above, the call $ann->value(-joins => ['|',','], -brackets => ['{','}']) will return "{CALM1,CAM1,CALM,CAM}|{CALM2,CAM2,CAMB}|{CALM3,CAM3,CAMC}", which may or may not be useful ...

> I just think that the 
> "nesting" semantics is kinda implicitly in the way the 
> parsers interact 
> with the files and objects

I'm not sure I understand what you mean.

> - a more explicit system would be to have
> 
> 
>   Bio::Annotation::GeneName
>                      (has-a list of Bio::GeneNameValues)
> 
> Or something similar.
> 

First, a mere list does not suffice, or the lists consists of lists (may be intimidating). Second, it can be written reasonably easy now, either heavyweight as Annotation::GeneName being a nested collection of Annotation::SimpleValue objects, or as inheriting off Annotation::StructuredValue.

The problem I see is that the semantics is not explicit, and Annotation::GeneName with simple lists wouldn't make that explicit either. By semantics I mean that the nesting carries meaning as to whether this is a synonym, or a homologue/paralogue. One could of course add such methods that make it explicit; OTOH I'd rather hope that other databases do not follow swissprot's example of normalizing by protein sequence, and then such an object may be confined solely to swissprot.

Bottom line is, I needed something there that at least leaves me a chance to extract the knowledge of the annotator (in fact, since biosql can't handle this kind of thing, right now I flatten out the nesting anyway before dumping it into biosql). I'm happy to change or enhance it if there's consensus, or adopt a better solution. (but the present one does work for me, so the ball is now in the court of those who don't like it :-)

	-hilmar