[Biojava-l] "packed" SymbolLists?

David Huen smh1008@cus.cam.ac.uk
Tue, 29 Oct 2002 19:23:13 +0100


On Thursday 24 Oct 2002 9:55 am, Thomas Down wrote:
>
> There's a fairly clear solution to this -- refactor PackedSymbolList
> to use a memory allocation scheme similar to ChunkedSymbolList
> (currently the default implementation for large sequences), so
> that new chunks can be allocated on the fly as the sequence grows.
> Anyone feel like taking this on?
>
Done.  It might do with some renaming of classes (names I choose often 
suck).  Internal code shamelessly ripped off from existing classes and 
munged/optimised.

There is now a ChunkedSymbolListFactory (i suspect it should be a builder 
but that name's taken) which takes a SymbolListFactory object.  
ChunkedSymbolListFactory has an addSymbols method just like the 
SeqIOListener one so you can wrap it in a class that implements that 
interface and take Symbols straight from your input source and chuck it 
into your new ChunkedSymbolList.  It also takes as an argument in its 
constructor a SymbolListFactory object that determines the object used to 
store the chunks.  I have created PackedSymbolListFactory and 
SimpleSymbolListFactory classes that can be passed to it which create 
chunks implemented by PackedSymbolLists and SimpleSymbolLists respectively.

Some optimisations have been done to improve PackedSymbolList performance.

When comparing chunked implementations backed by PackedSymbolLists and 
SimpleSymbolLists, the optimisations result in linear read speeds rising 
from 2.6->4.8 Mbase/s for PackedSymbolLists and 4.9->8.6 Mbase/s for 
SimpleSymbolLists.  This in on a P3/800 with 500MB RAM.  So there is still 
a penalty for using packed chunked lists (and plain symbol lists are even 
faster).  But if you should be operating on low RAM, the packed 
implementation will definitely win massively when you go into swap.  
Packing reduces object size 8-fold (32-bit to 4-bit) with ambiguity codes 
to 16-fold without.  You do realise that these numbers make it possible for 
us to run DAS reference servers for the human genome with all the symbols 
in core.  That'll just fly compared with any DB-backed implementation ;-).

Only DNA alphabet is currently supported (without gap symbols).

It probably can be improved further but that's all I have time for.  At some 
stage I will implement intelligent packing (ie. pack unambiguous sequence 
with unambiguous coding) which will get us more space savings under normal 
circumstances.  The scope for protein packing is less, around 3 fold so I'm 
not sure if it's worthwhile since proteins are relatively short.

Regards,
David Huen