[Bioperl-l] Annotation proposal

Ewan Birney birney@ebi.ac.uk
Fri, 6 Apr 2001 19:07:04 +0100 (BST)

Here is my proposal for annotation objects. I had a bit of a "I see
clearly now" moment thinking about this slippery customer when biking
for me train (also got a flat tire) so I bashed this out on the train.

It is based on the idea of run-time discovery of attributes held by
the annotation object. This can be implemented in any number of
ways. In some senses this is like the each_tag_value, has_tag_value
but with well defined objects returning. It blends a number of ideas I
have had about this area.

The main interface is called Bio::AnnotationContainerI

It defines three methods

    # type here does not imply Object type but annotation type.
    # for example, type == GO and type == tissue could both give
    # back objects of type 'Bio::ControlledVocab::Word'

    # if there are no objects, must return empty list. Cannot throw exception
    # due to an "unknown type" entered
    @objects = $annotationcontainer->get_all_Annotation_by_type('reference');

    @types   = $annotationcontainer->get_all_Annotation_types()

    # guarenteed to give back union of calling all types 
    @objects = $annotationcontainer->get_all_Annotation();

The Object has to implement the interface Bio::Annotation::BaseI,
which defines the following methods

    # returns a Bio::Annotation::EvidenceI 
    $evidence = $annotation->evidence()        

    # returns appropiate object - is allowed to return
    # undef. This is a bit like GNOME:: resolve
    $object   = $annotation->resolve_to_Object('Bio::Seq') 
    # these two methods   
    # returns a string, less than 80 chars, no line breaks
    $desc     = $annotation->description      

    # returns a string, less than 80 chars
    $type     = $annotation->type

I will need help designing Evidence interface. Here is a stab


    $evidence->type # string, one of "EVI_SIMILARITY,EVI_CURATOR,EVI_EXPERIMENT"
    @dblink = $evidence->referrals # array of dblinks. Can be assummed that

          SIMILARITY --- give database link
          EXPERIMENT --- give pubmed link or pseudo-pubmed link
          CURATOR    --- People object, which implements Bio::DBLinkI, 
                         with email == primary_id and name == optional_id
          UNKNOWN    --- Means unknown  

    I realise that the People object is perhaps a little weird. (Help! Can
someone with evidence tracking experience take over here?)

Now - I think we could argue that

    Bio::Seq and Bio::SeqFeature::Gene should *be* Bio::AnnotationContainerI's
not *have-a* annotation. This is closer to both OMG and BioJava things, which
I think is nice.

This scheme is *very* extensible, without burdening annotation containers
with having to deal with every possible annotation type. 

For the current objects, obviously Bio::Annotation becomes a
Bio::AnnotationContainerI and References, Comments and DBLink
implement Bio::Annotation::BaseI. In most cases their evidence will be
unknown when they are pulled from flat files (unless we can figure out
the flat file representation of this.... this is not so far for
swissprot). Comments will have a hard coded "Comment" description line
and "comment" type. References will have a description like
"Reference: $title" truncated at 80 chars. DBLinks will have a deduced
description line from their primary_id and database.

For Mark's problem, GO annotation, I would propose a new Bio::Annotation::BaseI
implementing class, called Bio::ControlledVocab::Word. Description will actually
be the word, type would be set centrally from Bio::ControlledVocab which would
contain all the words 

Or... you get the idea...

Plus points -
          no insistence on implementations implementing everything

          ... therefore extensible

          the interface Bio::Annotation::BaseI does actually define a
minimum set of "useful methods"

          No nasty $annotation->isa('....') calls

          will play well with GenBank/EMBL/Swissprot flat file formats

Minus points -

          Not strongly typed. ie, you cannot be sure that

          @dblinks = $annotation->get_all_Annotation_by_type('dblink');

will actually give you something. Also a problem that you don't know
whether an empty array means "not implement" or "implement but this
instance doesn't have any"

          Interface definitions perhaps don't buy you much (so we can
display them. Big deal!). For it to be useful there has to be a
limited of "used" annotation objects in different container
objects. However there is so little commonality about what people want
to store...

What do people think?


Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420