[Biojava-l] New Wiki page

Matthew Pocock mrp@sanger.ac.uk
Thu, 08 Feb 2001 11:34:38 +0000


Hi Paul,

A bit-compressed symbol-list implementation would be a good thing for 
whole-chromosome analysis. There is an interface called AlphabetIndex in 
the symbol package that maps an alphabet to/from integers. It should be 
fairly easy to write a SymbolList implementation that uses one of these, 
some bit-shifts and a byte-array to work out a relatively efficient 
stoorage mechanism for alphabets with <= 8 symbols (e.g. DNA & RNA).

We would need to benchmark this - pointers are cheap, but the page 
swapping is expensive. Bit-arithmetic potentialy costs more cpu, but you 
can fit more sequence into one chunk of memory.

On the other hand, we get away with running analysis programs over 
chromosome 1 (and 22 trivialy) by loading chunks of sequence on demand 
(behind a SymbolList implementation) - Thomas is the one to bug about 
this. Chunking byte-compressed sequences may be the optimal solution 
though...

Matthew

Paul Edlefsen wrote:

> Speaking of who's doing what, I was considering writing an implementation of
> SymbolList that takes a nibble (or maybe a byte) per DNA base instead of a
> word.  I've got this code in C++ and thought I'd port it over, though I
> haven't yet begun.
> 
> Is anybody else working along similar lines?  I need to read in multimegabase
> sequences and just 35Megabase Human chr.22 is too much for the current
> implementation, even increasing the heap to 128Megs.  (This makes sense:  35 M
> bases * 4 bytes/base > 128 M bytes).
> 
> Our goal is to make some open tools for whole-genome analysis and
> cross-species comparison.  35 Megabases is just the tip of the iceberg: to
> defend biojava to my peers I need to demonstrate that it can handle big
> sequences.
> 
> :Paul