[Biojava-l] ResidueList & Annotatable

Matthew Pocock mrp@sanger.ac.uk
Wed, 26 Jan 2000 11:46:26 +0000


Hi

Sorry if I say things that have already been said. I did most of the
design of the bio.seq and bio.alignment package, and quite a large slice
of the implementation. bio.seq was written after working with bioperl, and
took about two weeks to get sort of stable. We have hardly changed
anything since - except making some of the interfaces more beaney. Anyway,
I guess I am trying to say that it feels like my baby.

The aim wasn't to build a comp-sci tool kit of unusably elegant partial
solutions. We use these objects daily, and so they have to work well or
get edited. We don't like reading or remembering very much so we use a few
interfaces with all the documentation in them, and multiple implementatins
that can be chosen using their javadoc headline, or even chosen
automaticaly by a factory object. We chose minimalist interfaces so that
we could implement an idea in multiple ways without breaking the client
code, and we use delegation/composition much more than implementation
inheritance, as this lets us plug-and-play behaviour in a very organic and
natural way. Some things that could be implemented by inheritance are
implemented by parameterisation (e.g. proteins and dna use the same types
of objects - just a different alphabet of Residues). This captures the
essence of what we are modeling much more cleanly than rampant class
creation.

Anyway, enough of the design jihad.

Ann Loraine wrote:

> Hi,
>
> My 2 cents on the discussion of the Sequence interface as a
> subinterface of ResidueList -

ResidueList is what bioperl thinks of as a light-weight sequence. It
enforces alphabet type-checking, and has methods to extract bits of it.
Sub-residue lists are made using a factory method, and can be efficiently
implemented over Java's list.subList(start, end) method, which doesn't do
a memory copy - so we can load in Chr22 as a single residue list, and then
create ResidueList objects for every concievable bit of it, and there is
just the per-object overhead of stooring the offsets (which is all handled
in the case of SimpleResidueList by the sub-list implementation in
ArrayList). ResidueList has an Alphabet of the Residues that are allowed
to appear within it. This alphabet is used for implementing the fly-weight
design pattern which means that we cleanly use multiple object references,
not multiple objects.

Sequence adds the concept of ID, URN (which we hope to explore properly
within the next 6 months), and features. It also is annotatable (like many
bio.seq objects), which means that you can attach arbitrary key/value
information to the whole sequence.

Features let you flag a region of the sequence and are annotatable. Their
locations can be simple - a range, or complex - individual residues
scattered over the whole chromosome. You can put features within features.
Their co-ordinates are relative to the sequence, not their parent feature.
I played with relative co-ordinates & with nested/non-nested features, and
this option came out on top for both readability and ease of coding.

>
>
> I also am worried about speed & memory - what happens when we
> try to build a ResidueList that represents a megabase-sized
> sequence?  That's a lot of Residue Objects!

We flyweight them. That's a lot of Residue references to a fixed pool of
Objects, but little memory cost.

>
>
> Perhaps the implementation can avoid creating these million +
> Residues until someone calls the "residueAt(int it)" method -- sort of
> a lazy evaluation strategy.  (Please forgive me if I misunderstand
> the design.)

If space is an issue, ee could actualy store the DNA itself inside the
ResidueList as a bit-field - then each base would take two bits, and we
could fit 8 into a java char. The residueAt(indx) method would do the
translation to/from residue objects, and you would realy save space.
However, this would tie the ResidueList implementation to DNA, making it
alphabet specific. You can achieve this cleanly by interposing a layer of
factory objects for making ResidueLists, as we have done for State
objects, so if space realy is an issue we can overcome. Anyway, this is
why we always design by interface. That way, the bit-field implementation
can be swapped in to a program along side the ArrayList implementation,
allong side the implementation that queries via JDBC, and they can all
play happily together.

>
>
> May I move on to another aspect of the data models at:
>
> http://www.sanger.ac.uk/Users/td2/biojava_core_20000121/ ?
>
> I have a question/comment about the Annotatable interface and its
> getAnnotation() method - which returns an Annotation Object.
>
> Seems like it might be wise to have a getAnnotations()
> (plural) method instead.

The Annotatable interface is implemented by any bio object that wants to
have associated object-wide annotation, of potentialy arbitrary type. (Yes
- we would need some beaney frame-work for displaying this.) The
Annotation object returned is basicaly an anal implementation of a map -
if keys don't exist, exceptions are thrown. A particular annotation
implementation may only allow certain keys or values (e.g. if it's a
wrapper around a DB record) - exceptions can be thrown. So, you can
associate as many bits of information as you like with as many keys as you
can think of within an Annotation object (if the implementation allows).
Also, the objects you put in could be Annotation objects themselves, in
which case you can build a data tree. This is the strategy followed for
naive implementations of the ACeDB annotations for sequences - each node
in the ACeDB object is an Annotation, keyed by it's value in the parent.

>
>
> For example, a Sequence could have many Annotations - gene
> predictions, promoter elements, etc.
>
> Or do you intend instead for each prediction, promoter element, etc be
> represented by a different Sequence Object?

You can always pull out a sub-sequence for a particular location - and
their should be methods for the sub-sequence to get features from the
parent sequence in its own coordinate system (not implemented yet - shame
on me!). This would have minimal overhead for the reasons stated above.

Matthew

>
>
> -Ann
>
> --
>  Ann E. Loraine
>  loraine@fruitfly.berkeley.edu
>  http://www.fruitfly.org/~loraine
>
>  Berkeley Drosophila Genome Project
>  539 Life Science Addition
>  U.C. Berkeley
>  Berkeley, CA  94720
>  TEL: 510-643-0657
>  FAX: 510-643-9947
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l