[Biojava-dev] Re: SequenceDB way too big!!!
Matthew Pocock
matthew_pocock at yahoo.co.uk
Tue Feb 11 09:55:06 EST 2003
Hi Russell,
Inside the default handler for symbols in the parsers, it now checks how
much sequence has been read. If this exceeds some threshold (I think
it's 5MB right now, but who's counting), it uses a bit-packed
implementation of SymbolList. David put some other smarts in there so
that regions containing ambiguities use the 4-bit-per-nucleotide packing
and regions that are not ambiguous use the 2-bit-per-nucleotide packing.
For performance reasons (to avoid allocate/copy cycles), the symbols are
stored in a list of fixed-length buffers and during parsing, extra
buffers are appended as required.
Our benchmarks indicate that bit-packed symbols are aproximately twice
as expensive to access. However, they take aproximately 1/16 of the
memory. Also, David did some tweaks that mean that speeded access up so
that packed symbols in this release are actualy accessed faster than
unpacked symbols where in 1.2 (and unpacked symbols are correspondingly
faster).
Of course, all of this hides behind the SymbolList, Symbol and Alphabet
interfaces, so all existing code works without any modifications.
Matthew
Russell Smithies wrote:
> Hi,
> To save me some time hunting thru the source, what was the main trick to
> reducing memory usage?
>
> thanx
> Russell
More information about the biojava-dev
mailing list