[Biojava-l] ResidueList & Annotatable

Thomas Down td2@sanger.ac.uk
Wed, 26 Jan 2000 09:22:17 +0000

On Tue, Jan 25, 2000 at 11:51:40PM -0800, Ann Loraine wrote:
> I also am worried about speed & memory - what happens when we 
> try to build a ResidueList that represents a megabase-sized
> sequence?  That's a lot of Residue Objects!
> Perhaps the implementation can avoid creating these million +
> Residues until someone calls the "residueAt(int it)" method -- sort of
> a lazy evaluation strategy.  (Please forgive me if I misunderstand
> the design.)

This seems to be a common misunderstanding -- having Residues
represented by first class objectd does NOT imply that a new
object is needed for each base of a long sequence.  The
Sequence just stores REFERENCES to objects.  Unless you want
to do something interesting like having separate annotations for
each and every residue, there will normally just be one single
object to represent every Adenosine, everywhere within the
virtual machine.  Everything else is just a reference (which is,
for practical purposes, a pointer).  Yes, there IS a memory
overhead for this, but it's only going to be four bytes
(eight on a 64 bit architecture) compared to two bytes for
a Java char.

There should be no performance problems at all -- comparing
two residue objects will normally just be a question of comparing
two pointers (no more expensive, CPU-wise, than comparing two
chars -- maybe even faster on some modern processor architecture
which aren't actually terribly happy handling char data).

> May I move on to another aspect of the data models at:
> http://www.sanger.ac.uk/Users/td2/biojava_core_20000121/ ?
> I have a question/comment about the Annotatable interface and its 
> getAnnotation() method - which returns an Annotation Object.
> Seems like it might be wise to have a getAnnotations()
> (plural) method instead.  
> For example, a Sequence could have many Annotations - gene
> predictions, promoter elements, etc.
> Or do you intend instead for each prediction, promoter element, etc be
> represented by a different Sequence Object?  

No, I certainly wouldn't want to impose the one-type-of-annotation-
per-Sequence limit.  If you look at the bio.seq.Annotation interface,
you will see that it actually represents a set of keyed Objects
associated with a Sequence (or some other BioJava object) -- there
should be no problem associating many different pieces of data
with the Sequence.

Note that the Annotation mechanism is only really meant for
storing data which correponds to the whole sequence -- for
instance, information about how a sequence was obtained,
references to journals, etc..  Annotations which apply to
specific locations on the sequence (e.g. promoter elements)
would be better represented using the more structured 
bio.seq.Feature interface.

``Science is magic that works''  -- Kurt Vonnegut.