[Biojava-l] ResidueList & Annotatable

Ann Loraine loraine@fruitfly.bdgp.berkeley.edu
Wed, 26 Jan 2000 09:09:11 -0800 (PST)


> > 
> > I also am worried about speed & memory - what happens when we 
> > try to build a ResidueList that represents a megabase-sized
> > sequence?  That's a lot of Residue Objects!
> > 
> > Perhaps the implementation can avoid creating these million +
> > Residues until someone calls the "residueAt(int it)" method -- sort of
> > a lazy evaluation strategy.  (Please forgive me if I misunderstand
> > the design.)
> 
> This seems to be a common misunderstanding -- having Residues
> represented by first class objectd does NOT imply that a new
> object is needed for each base of a long sequence.  The
> Sequence just stores REFERENCES to objects.  Unless you want
> to do something interesting like having separate annotations for
> each and every residue, there will normally just be one single
> object to represent every Adenosine, everywhere within the
> virtual machine.  Everything else is just a reference (which is,
> for practical purposes, a pointer).  Yes, there IS a memory
> overhead for this, but it's only going to be four bytes
> (eight on a 64 bit architecture) compared to two bytes for
> a Java char.

Ah...thanks for the explanation!  I see the javadocs even state:

"there can be one instance of each residue that is referenced multiple
times"

One last concern I have about getting away from Strings:  

I'd like to use java regular expression packages to examine sequence,
and such packages* most likely would deal with Strings.  If there's a way
to get a String out of a ResidueList then I will be satisfied.

(e.g., savarese.org, ORO software)

> 
> There should be no performance problems at all -- comparing
> two residue objects will normally just be a question of comparing
> two pointers (no more expensive, CPU-wise, than comparing two
> chars -- maybe even faster on some modern processor architecture
> which aren't actually terribly happy handling char data).
> 
> > May I move on to another aspect of the data models at:
> > 
> > http://www.sanger.ac.uk/Users/td2/biojava_core_20000121/ ?
> > 
> > I have a question/comment about the Annotatable interface and its 
> > getAnnotation() method - which returns an Annotation Object.
> > 
> > Seems like it might be wise to have a getAnnotations()
> > (plural) method instead.  
> > 
> > For example, a Sequence could have many Annotations - gene
> > predictions, promoter elements, etc.
> > 
> > Or do you intend instead for each prediction, promoter element, etc be
> > represented by a different Sequence Object?  
> 
> No, I certainly wouldn't want to impose the one-type-of-annotation-
> per-Sequence limit.  If you look at the bio.seq.Annotation interface,
> you will see that it actually represents a set of keyed Objects
> associated with a Sequence (or some other BioJava object) -- there
> should be no problem associating many different pieces of data
> with the Sequence.
> 
> Note that the Annotation mechanism is only really meant for
> storing data which correponds to the whole sequence -- for
> instance, information about how a sequence was obtained,
> references to journals, etc..  Annotations which apply to
> specific locations on the sequence (e.g. promoter elements)
> would be better represented using the more structured 
> bio.seq.Feature interface.
> 
> 
> Thomas.
> -- 
> ``Science is magic that works''  -- Kurt Vonnegut.
> 

Okay, I think I'm understanding this now.  

So a Sequence would have a single Annotation Object, which itself has
numerous Features, all retrievable if I know what "key" to use?

So if I wanted all the exons in a sequence, I could do something like:

Object exon_list = annotation.get("exons") ?

And exon_list would be a Set or some other data structure which
contained Features representing exons?

-Ann