[Bioperl-l] New Annotation interfaces! Mark/David/whoever - check it out!

Ewan Birney birney@ebi.ac.uk
Tue, 30 Oct 2001 21:00:56 +0000 (GMT)

Well - two hours of British Rail has its benefits. I have committed the
new Annotation framework. This is a pretty major set of changes! The 
exciting thing is that this is **definitely** the right way to go.

The most important point here is the framework is

  (a) extensible 


  (b) plays well with XML/data orientated approaches.

I've included in this message to two main interfaces -
Bio::AnnotationCollectionI and Bio::AnnotationI. I have implemented
this in Bio::Annotation::Collection and adapted the existing
Bio::Annotation::* classes to work with this.

Then I have added in backward-compatibility harness for
Bio::Annotation::Collection for the (i guess) 0.7.* API (it calls
deprecated for each function, so you will know it).

Then I adapted the genbank/embl/swiss SeqIO systems to work - was easy
(tick for the design I think) and t/SeqIO.t passed without any additional
work (wow!)

Things on my TODO list 

  - put in Controlled Vocab somehow - need to talk to the GO folks 
to check I do it right

  - deal/decide with updates as I am sure GenQuire will want to write
back here (right guys??)

  - check XML outputting

on explatory is

  - generic XML registration for generic XML stream -> Annotation objects

This has been a Bioperl production brought to you by British Rail...


=head1 NAME

Bio::AnnotationCollectionI - Interface for annotation collections


   # get an AnnotationCollectionI somehow, eg

   $ac = $seq->annotation();

   foreach $key ( $ac->get_all_annotation_keys() ) {
       @values = $ac->get_Annotations($key);
       foreach $value ( @values ) {
          # value is an Bio::AnnotationI, and defines a "as_text" method
          print "Annotation ",$key," stringified value ",$value->as_text,"\n";
          # also defined hash_tree method, which allows data orientated
          # access into this object
          $hash = $value->hash_tree();


Annotation Collections are a way of storing a series of "interesting
facts" about something. We call an "interesting fact" in Bioperl an
Annotation (this differs from a Sequence Feature, which is called
a Sequence Feature and may or may not have an Annotation Collection).

The trouble about this is we are not that sure what "interesting
facts" someone might want to store: the possibility is endless. 

Bioperl's approach is that the "interesting facts" are represented by
Bio::AnnotationI objects. The interface Bio::AnnotationI guarentees
two methods

   $obj->as_text(); # string formated to display to users


   $obj->hash_tree(); # hash with defined rules for data-orientated discovery

The hash_tree method is designed to play well with XML output and
other "nested-tag-of-data-values" think BoulderIO and/or Ace stuff. For more
info read Bio::AnnotationI docs

Annotations are stored in AnnotationCollections, each Annotation under a
different "tag". The tags allow simple discovery of the available annotations,
and in some cases (like the tag "gene_name") indicate how to interpret the
data underneath the tag. The tag is only one tag deep and each tag can have an
array of values.

In addition, AnnotationCollectionI's are guarentee to maintain a consistent
set object values under each tag - at least that each object complies to one
interface. The "standard" AnnotationCollection insists the following rules
are set up

  Tag         Object
  ---         ------
  reference   Bio::Annotation::Reference
  comment     Bio::Annotation::Comment
  dblink      Bio::Annotation::DBLink
  gene_name   Bio::Annotation::SimpleValue
  description Bio::Annotation::SimpleValue

These tags are the implict tags that the SeqIO system needs to round-trip

However, you as a user and us collectively as a community can grow the
"standard" tag mapping over time and specifically for a particular

=head1 NAME

Bio::AnnotationI - Annotation interface


  # generally you get AnnotationI's from AnnotationCollectionI's

   foreach $key ( $ac->get_all_annotation_keys() ) {
       @values = $ac->get_Annotations($key);
       foreach $value ( @values ) {
          # value is an Bio::AnnotationI, and defines a "as_text" method
          print "Annotation ",$key," stringified value ",$value->as_text,"\n";
          # you can also use a generic hash_tree method for getting 
          # stuff out say into XML format
          $hash_tree = $value->hash_tree();


Interface all annotations must support. There are two things that each annotation
has to support.


Annotations have to support an "as_text" method. This should be a
single text string, without newlines representing the annotation,
mainly for human readability. It is not aimed at being able to
store/represent the annotation

The second method allows annotations to at least attempt to represent
themselves as pure data for storage/display/whatever. The method

   $hash = $annotation->hash_tree();

should return an anonymous hash with "XML-like" formatting. The
formatting is as follows.

  (1) For each key in the hash, if the value is a reference'd array -

      (2) For each element of the array if the value is a object - 
          Assumme the object has the method "hash_tree";
      (3) else if the value is a referene to a hash
          Recurse again from point (1)
      (4) else 
          Assumme the value is a scalar, and handle it directly as text

   (5) else (if not an array) apply rules 2,3 and 4 to value

The XML path in tags is represented by the keys taken in the
hashes. When arrays are encountered they are all present in the path
level of this tag

This is a pretty "natural" representation of an object tree in an XML
style, without forcing everything to inheriet off some super-generic
interface for representing things in the hash.