[Bioperl-l] Re: Annotation structure

Thomas Down td2@sanger.ac.uk
Fri, 3 Aug 2001 15:51:22 +0100

On Thu, Aug 02, 2001 at 10:57:14PM +0100, Ewan Birney wrote:
> As mentioned at BOSC, I want to overhaul the annotation
> structure. Currently we have the rather crappy 
>   $seq->anntation(); # gives an annotation object
>   $annotation->each_Reference(); # list of refernces
>   $annotation->each_Comment();   # list of comments
>   $annotation->each_DBLink();    # list of dblinks
> This is very much "what you need to store for round tripping
> genbank-embl". it is not focused at all on extensibility.
> The proposal is to head towards more of a generic tag => list of values
> scheme which will (a) extend better (b) plays well with biojava and
> biocorba much better. My current proposal is this
> Bio::Annotation moves to Bio::AnnotationCollection. 
> Bio::AnnotableI (direct copy from biojava) defines the method
>    $obj->annotation(); 
> which gives back a Bio::AnnotationCollection

Looks good so far...

> (a) I always feel we have one too many class here - I sort of want to
> remove AnnotableI and make Seq inheriet from AnnotationCollectionI. But
> this is the way biojava does it (which may well be due to how we did it in
> the first place) relates to (c) below

I can't really give a BioPerl perspective here, but certainly
in BioJava I still believe the Annotatable/Annotation division
is a good idea, since it allows you to decouple (when appropriate)
the annotatable objects themselves (features, sequences, whatever)
with the Annotations (which are generally just storage objects).

[This can, of course, also be done with delegation.  Swings
and roundabouts].

It's also quite nice, when hacking up a quick implementation
of an Annotatable interface which doesn't have anything
special associated, to do:

  public Annotation getAnnotation() {
      return Annotation.EMPTY_ANNOTATION;


Another trick I sometimes do to keep the costs of using
Annotations (which are generally backed by Java HashMaps)
as low as possible when you're working with thousands, if
not millions, of objects, is:

  public Annotation getAnnotation() {
      Annotation temp = new SimpleAnnotation();
      temp.setProperty("the.one.thing.I.want.to.expose", blah);
      return temp;

(This pattern is used in a few places in biojava-ensembl).

> (b) We've got have some additional standard of "standard" keys, like
>    reference, dblink, comment 
> etc to agree on. That's ok - that's what you live with for extensibility,
> but there is an argument that you might want something more heirarchical
> such that
>     @objects = $ann->get_Annotation("geneticdisase")
> would give you back Bio::Something::Disease::Genetic but 
>     @objects = $ann->get_Annotation("disease")
> gives back the superset. Some heirarchical type system (centrally
> controlled?) controls the standard. (good? Bad?)
> After thinking about this I don't like it - it is asking for quite a heavy
> system behind the scenes (not so heavy, but heavy enough) to manage this
> and will make implementing other objects behind this interface
> tough. Hmph.
> In general, if we do set a standard set of tags, to what extent should we
> enforce the tag-->object mapping. I'm leaning towards relatively strictly
> enforcing it with a hash in AnnotationCollectionI being something like
> %tag_object_map = (
> 	'reference' => 'Bio::Annotation::RefernceI',
> 	'dblink'    => 'Bio::Annotation::DBLinkI',
> 	'comment'   => 'Bio::Annotation::CommentI' );
> with the idea that implementations enforce these rules of their annotation
> collections.

That mapping you give there looks to me /exactly/ the sort
of thing which you can describe in a language like DAML+OIL
(or even RDF Schema, on which DAML is based).  I'd rather
like to see this happen in BioJava, too.

(Incidentally, DAML+OIL has a concept of `subproperties', which
seems to fit what you're describing in the disease/geneticdisease
case rather well...  As you say, there's a little bit of extra
complexity here.  On the other hand, it's perfectly valid to have
an AnnotationCollectionI implementation which /doesn't/ support
sub-properties.  And you may well be able to get away with only
implementing the sub-property support code once).

> (c) Biojava and Biocorba reuse the annotation interface for the tag-value
> qualifiers off features and (therefore) have the same extensibility of
> their annotations. I've always been against this because it seems to have
> to store what is very often strings, but I think I have been a bit of a
> luddite here: the killer use case is a gene seqfeature which should have
> as rich an annotation - and as extensible - as a sequence.
> The problem here is that I want to keep backward compatibility with the
> current has_tag_value, each_tag_value system on SeqFeatureI reusing the
> AnnotationI ->string method to allow to put these in. This means I want
>   SeqFeatureI to inheriet from AnnotationCollectionI
> this is different from biojava which I believe has SeqFeatureI equivalent 
> inherieting from Annotable and so having a separate annotation call to the
> annotationcollection object. To make sure seqfeatures can maintain the old
> has_tag_value etc there would be some somewhat ugly delegation (i guess
> not so bad) out to this annotation object.
> This will make SeqFeature::Generic much heavier if we have to build a
> Bio::AnnotationCollection object for each SeqFeature::Generic and this is
> bad news as we make millions of SeqFeature::Generic's...
> So - question for Matt/Thomas - why do you split out AnnotableI and
> AnnotationCollectionI in biojava - what is the win?

The answer to this question might not be the same for Java
and Perl.  However, with my BioJava hat on:

It's a nice thing to decouple, and it encourages re-use of
implementation.  Especially since BioJava re-uses Annotatable
in quite a few places -- not just Sequence and Feature.

In Java, the per-object overhead is really pretty low.  Of
course, once your objects are wrappers around HashMaps, then
you're up to an overhead approximately equal to Perl (obviously),
but we use various tricks to work around this:

  - We've got a SmallAnnotation implementation, which uses a linear
    look-up table rather than a hash-table.  Big memory saving
    when there's lots of them, and it's actually as fast as a
    hashtable for up to 10-20 keys (which covers most common
    use cases).

  - Flyweighted EMPTY_ANNOTATION (which is also quite common
    for some things).

  - Many custom feature implementations construct their Annotation
    on the fly rather than storing them.

Of course, if you don't want the extra object, there's no
reason why you can't have an Annotatable object which itself
implements your AnnotationCollectionI interface, is there?

> Do people have opinions on
> this? Jason/Hilmar/Heikki/Matt/Thomas/Mark+David are the people I am most
> interested in hearing from. Key questions:
>   (a) rigid biojava/biocorba cribbing, or removing this AnnotableI
> interface? (I favour removing)

I don't think this is such a big decision, really.  If you do
remove the interface, you can always have implementations which delegate
to an external AnnotationCollection.  Correspondingly, you can
keep the interface, but write a generic SeqFeature implementation
which manages its Annotations itself, rather than passing back
another object.

The latter feels a little cleaner to me, but that could just
be what I'm used to.

>   (b) type enforcement of standard types (I like enforcement - it will
> catch otherwise weird lookig bugs)

Yes yes yes.  We'll be watching this with a /lot/ of interest
from the BioJava side, if you go forward with this.

Can I point out the DAML+OIL stuff again, as a rather nice
way of specifying what properties you consider `standard', and
what types you expect to see.


>   (c) type heirarchy or flat (I favour flat)

Hierarchy is definitely more powerful, but realistically I guess
that flat covers the bulk of the use cases for now.  BioJava is
completely flat at the moment, and we've managed alright so far.
But this might change in the future (BioJava 2?).