[Bioperl-l] Structured (nested) Annotation

Chris Mungall cjm@fruitfly.org
Mon, 7 Oct 2002 13:07:02 -0700 (PDT)


Hmm, I have a few problems with this thread - I'm always a little uneasy
with the idea that everything has to fit into some perfect uber object
model, and anything that diverges from the Perfect Model is Obviously
Wrong.

Ok, I'll admit the practice of using the same sequence entity for multiple
proteins in multiple species seems a little off - if you want to annotate
the sequence with post-translational modifications, can you be sure the
annotations are true across all species?

But then swissprot may have good reasons for collapsing identical
sequences into the same entity - ease of database management for one
thing.

Besides, ensembl does a similar thing with alternate spliceforms producing
proteins with identical sequences - these are collapsed into the same
entity, even though they are distinct proteins, possibly with distinct
cellular localisation, post translational modifications etc.

Most ensembl users don't care, so therefore ensembl is correct to do this.
But the point is, there are always multiple ways of viewing the data. A
view with everything disentangled will be too big, you're always going to
have to collapse some entities (eg protein and protein sequence). There's
no one correct way of doing it.

Another thing - has anyone considered instead of the Bio::Annotation
object just attaching a lightweight xml structure to the seq/feature? this
could be a simple nested array. you could use standard ways of
querying/transforming this. I've used this pattern in the past, it's nice.
Strict model where you need it, loose/extensible where you want it.

Ok, I will admit this is a bit daft:
GN   (CALM1 OR CAM1 OR CALM OR CAM) AND (CALM2 OR CAM2 OR CAMB) AND
GN   (CALM3 OR CAM3 OR CAMC).

But I'm sure you can get away with turning it into a flat list - gene
symbols are generally a bit of a nightmare anyway and that's no sp's
fault.