[Biojava-l] Re: the current discussion

Thomas Down td2@sanger.ac.uk
Wed, 26 Jan 2000 09:57:32 +0000


On Wed, Jan 26, 2000 at 03:55:45PM +1300, Mark Schreiber wrote:
>
> I agree with the utility of it but practically if you want to number
> crunch a bacterial genome that means 4 million objects to hold in memory
> unless you are a tricky programmer (which I am not). The memory
> requirements must surely be substantially more than for a single object
> cantaining a 4 million member String. 

As previous replies have pointed out, we're only talking about object
references, not whole object instances.  The memory overhead will only
be 2-4 times that of using Strings.  In practice, we've loaded plenty
of large pieces of DNA (including human chromosome 22, which I believe
is still the longest piece available) and processed them without
any trouble.

If you're really worried about memory, there's no reason why you
can't have an object which stores data as a String but which
implements the Sequence interface, and returns residue objects
on request.  In fact, if you're really worried you could use
a byte array instead -- using half as much RAM as a Java string,
and the same amount as a C string.

> > Our sequences have  a simple method to retrieve the sequence as a string of
> > chars where each char represents a single residue. Also, as everything is
> > implemented on top of interfaces, you could write an implementation that realy
> > did use a string of chars to represent the sequence, as long as you wired in
> > apropreate residueAt and iterator methods.
> 
> Having the ability to change between the two models is definitely the way
> to go. (Unless you think there is no use for String based analysis).

I'd hope that most analysis methods written using BioJava
would take the String approach.  But it would still be nice
to have bridges which allow us to interact with OMG BioObject
sequences, assisting multi-language development.

> > Having residue objects catches loads of errors that would go unnoticed
> > otherwise. Also, for HMMs, each state within the model is a State object that
> > extends Residue, so you can naturaly manipulate sequences of states. This is
> > realy usefull - a multiple-sequece-aligment can contain sequences and states.
> > But - as states are not defined by chars, we can make virtual states with no
> > sensible way of naiming them.
> 
> I like this idea but I have one reservation. By deriving a State object
> from a Residue object you loose some of the flexibility of an HMM. This is
> because a State can emit no just a residue but also a string of
> residues (as in GeneMark.hmm) or a vector, or even another HMM (or
> anything else that you may want to emit).

Having States in an HMM implement Residue does not put any restrictions
on what the States can emit -- different types of states could emit
no residues, one, or many.  But having the States implement Residue
means that you can build a ResidueList (sequence) of States used
in the model.  This ResidueList effectively becomes a `labelling'
of the biological sequence you are processing.

Thomas.
-- 
``Science is magic that works''  -- Kurt Vonnegut.