[Biojava-l] ResidueList & Annotatable
Ann Loraine
loraine@fruitfly.bdgp.berkeley.edu
Wed, 26 Jan 2000 09:09:11 -0800 (PST)
> >
> > I also am worried about speed & memory - what happens when we
> > try to build a ResidueList that represents a megabase-sized
> > sequence? That's a lot of Residue Objects!
> >
> > Perhaps the implementation can avoid creating these million +
> > Residues until someone calls the "residueAt(int it)" method -- sort of
> > a lazy evaluation strategy. (Please forgive me if I misunderstand
> > the design.)
>
> This seems to be a common misunderstanding -- having Residues
> represented by first class objectd does NOT imply that a new
> object is needed for each base of a long sequence. The
> Sequence just stores REFERENCES to objects. Unless you want
> to do something interesting like having separate annotations for
> each and every residue, there will normally just be one single
> object to represent every Adenosine, everywhere within the
> virtual machine. Everything else is just a reference (which is,
> for practical purposes, a pointer). Yes, there IS a memory
> overhead for this, but it's only going to be four bytes
> (eight on a 64 bit architecture) compared to two bytes for
> a Java char.
Ah...thanks for the explanation! I see the javadocs even state:
"there can be one instance of each residue that is referenced multiple
times"
One last concern I have about getting away from Strings:
I'd like to use java regular expression packages to examine sequence,
and such packages* most likely would deal with Strings. If there's a way
to get a String out of a ResidueList then I will be satisfied.
(e.g., savarese.org, ORO software)
>
> There should be no performance problems at all -- comparing
> two residue objects will normally just be a question of comparing
> two pointers (no more expensive, CPU-wise, than comparing two
> chars -- maybe even faster on some modern processor architecture
> which aren't actually terribly happy handling char data).
>
> > May I move on to another aspect of the data models at:
> >
> > http://www.sanger.ac.uk/Users/td2/biojava_core_20000121/ ?
> >
> > I have a question/comment about the Annotatable interface and its
> > getAnnotation() method - which returns an Annotation Object.
> >
> > Seems like it might be wise to have a getAnnotations()
> > (plural) method instead.
> >
> > For example, a Sequence could have many Annotations - gene
> > predictions, promoter elements, etc.
> >
> > Or do you intend instead for each prediction, promoter element, etc be
> > represented by a different Sequence Object?
>
> No, I certainly wouldn't want to impose the one-type-of-annotation-
> per-Sequence limit. If you look at the bio.seq.Annotation interface,
> you will see that it actually represents a set of keyed Objects
> associated with a Sequence (or some other BioJava object) -- there
> should be no problem associating many different pieces of data
> with the Sequence.
>
> Note that the Annotation mechanism is only really meant for
> storing data which correponds to the whole sequence -- for
> instance, information about how a sequence was obtained,
> references to journals, etc.. Annotations which apply to
> specific locations on the sequence (e.g. promoter elements)
> would be better represented using the more structured
> bio.seq.Feature interface.
>
>
> Thomas.
> --
> ``Science is magic that works'' -- Kurt Vonnegut.
>
Okay, I think I'm understanding this now.
So a Sequence would have a single Annotation Object, which itself has
numerous Features, all retrievable if I know what "key" to use?
So if I wanted all the exons in a sequence, I could do something like:
Object exon_list = annotation.get("exons") ?
And exon_list would be a Set or some other data structure which
contained Features representing exons?
-Ann