[Bioperl-l] Structured (nested) Annotation
Hilmar Lapp
hlapp@gnf.org
Sun, 6 Oct 2002 23:21:22 -0700
I wanted to be able to represent structured, nested annotation (we
were there a while ago weren't we?). I made two changes/additions
that can accomplish this, but in different ways.
1) Annotation collections can now be nested, because
Annotation::Collection now implements AnnotationI.
I added two methods dealing specifically with nested annotation:
get_all_Annotations() is similar to get_Annotations(), but traverses
the whole tree of nested collections (if there is no nesting, it
behaves identical to get_Annotations(). flatten_Annotations() makes
a nested collection un-nested.
I thought it may be a good idea to promote get_all_Annotations() to
the interface (AnnotationCollectionI). What do people think?
Ewan/Jason?
2) Nesting through all-objects is somewhat heavy-weight if all you
want to nest is simple values. So I added
Annotation::StructuredValue which inherits from
Annotation::SimpleValue and can be called as if it were a simple
value. In addition, there are methods to add simple values in a
structured, nested way, and control the way the structure is
flattened into a single string. Also, there is get_all_values()
returning a flattened array of values.
My starting use case was to somehow retain the structured
information in swissprot GN lines.
In case you're unfamiliar with swissprot, GN gives the names of the
genes that give rise to the protein sequence of the entry. Different
genes are concatenated by ' AND ', whereas synonyms of the same gene
are concatenated by ' OR '. Both may co-occur in the same GN line,
in which case parentheses are used to group. An infamous example is
Calmodulin (many many species for this entry ...):
GN (CALM1 OR CAM1 OR CALM OR CAM) AND (CALM2 OR CAM2 OR CAMB) AND
GN (CALM3 OR CAM3 OR CAMC).
The parser screwed this up before because it didn't know about the
AND join operator nor possible nesting (nor multiple GN lines). I
fixed all this and changed the parser to now construct a
Annotation::StructuredValue object for this.
NOTE: The consequence is that now you get back only _ONE_ object for
$seq->annotation->get_Annotations("gene_name"). You need to call
get_all_values() on that object to see all the names. Calling
value() will return the structured array flattened into a single
string.
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------