[Biojava-l] "packed" SymbolLists?
David Huen
smh1008@cus.cam.ac.uk
Tue, 29 Oct 2002 19:23:13 +0100
On Thursday 24 Oct 2002 9:55 am, Thomas Down wrote:
>
> There's a fairly clear solution to this -- refactor PackedSymbolList
> to use a memory allocation scheme similar to ChunkedSymbolList
> (currently the default implementation for large sequences), so
> that new chunks can be allocated on the fly as the sequence grows.
> Anyone feel like taking this on?
>
Done. It might do with some renaming of classes (names I choose often
suck). Internal code shamelessly ripped off from existing classes and
munged/optimised.
There is now a ChunkedSymbolListFactory (i suspect it should be a builder
but that name's taken) which takes a SymbolListFactory object.
ChunkedSymbolListFactory has an addSymbols method just like the
SeqIOListener one so you can wrap it in a class that implements that
interface and take Symbols straight from your input source and chuck it
into your new ChunkedSymbolList. It also takes as an argument in its
constructor a SymbolListFactory object that determines the object used to
store the chunks. I have created PackedSymbolListFactory and
SimpleSymbolListFactory classes that can be passed to it which create
chunks implemented by PackedSymbolLists and SimpleSymbolLists respectively.
Some optimisations have been done to improve PackedSymbolList performance.
When comparing chunked implementations backed by PackedSymbolLists and
SimpleSymbolLists, the optimisations result in linear read speeds rising
from 2.6->4.8 Mbase/s for PackedSymbolLists and 4.9->8.6 Mbase/s for
SimpleSymbolLists. This in on a P3/800 with 500MB RAM. So there is still
a penalty for using packed chunked lists (and plain symbol lists are even
faster). But if you should be operating on low RAM, the packed
implementation will definitely win massively when you go into swap.
Packing reduces object size 8-fold (32-bit to 4-bit) with ambiguity codes
to 16-fold without. You do realise that these numbers make it possible for
us to run DAS reference servers for the human genome with all the symbols
in core. That'll just fly compared with any DB-backed implementation ;-).
Only DNA alphabet is currently supported (without gap symbols).
It probably can be improved further but that's all I have time for. At some
stage I will implement intelligent packing (ie. pack unambiguous sequence
with unambiguous coding) which will get us more space savings under normal
circumstances. The scope for protein packing is less, around 3 fold so I'm
not sure if it's worthwhile since proteins are relatively short.
Regards,
David Huen