[Biojava-dev] Re: SequenceDB way too big!!!

Tue Feb 11 09:55:06 EST 2003

Hi Russell,

Inside the default handler for symbols in the parsers, it now checks how 
much sequence has been read. If this exceeds some threshold (I think 
it's 5MB right now, but who's counting), it uses a bit-packed 
implementation of SymbolList. David put some other smarts in there so 
that regions containing ambiguities use the 4-bit-per-nucleotide packing 
and regions that are not ambiguous use the 2-bit-per-nucleotide packing. 
For performance reasons (to avoid allocate/copy cycles), the symbols are 
stored in a list of fixed-length buffers and during parsing, extra 
buffers are appended as required.

Our benchmarks indicate that bit-packed symbols are aproximately twice 
as expensive to access. However, they take aproximately 1/16 of the 
memory. Also, David did some tweaks that mean that speeded access up so 
that packed symbols in this release are actualy accessed faster than 
unpacked symbols where in 1.2 (and unpacked symbols are correspondingly 
faster).

Of course, all of this hides behind the SymbolList, Symbol and Alphabet 
interfaces, so all existing code works without any modifications.

Matthew

Russell Smithies wrote:
> Hi,
> To save me some time hunting thru the source, what was the main trick to
> reducing memory usage?
> 
> thanx
> Russell