[Biojava-l] Packed DNA Symbol List

David Huen smh1008@cus.cam.ac.uk
Mon, 11 Feb 2002 14:13:52 +0000 (GMT)


On Mon, 11 Feb 2002, Matthew Pocock wrote:

> Cool David! Have you got any stats about the relative performance of 
> the raw and packed implementations? The issue with AlphabetIndex and 
> ambiguities is my fault. I wrote the imlementations not to index 
> ambiguities. What do you use an indexer for? I'm happy for you to commit 
> away. Thomas? Others?
> 
I'm using the indexer to convert symbols into 4-bit values that form the
array.

Length 200000 symbols. Athlon MP (real 1200 MHz, whatever the BogoHertz
are).

For converting a SimpleSymbolList via constructor into:-
SimpleSymbolList    79 ms.
PackedDNASymbolList 29 ms. (why is this faster than the above????)

For reading thru' 200000 symbols sequentially,
SimpleSymbolList     4 ms.
PackedDNASymbolList 15 ms. (this is more expected but I expected it to
be even worse than this!).

I have tried reorganising the alphabet index to make the common symbols
come first but that seems to have a negligible impact on on performance
compared to having a bit-to-base mapping.  I'm a bit surprised by this - I
suppose it just means that symbol lookup is not a major factor.  OTOH,
computing the element of the array to look up is major as replacing a
division and a modulo with two bit ops doubled the performance.

There would be better performance if I could use an entity bigger than a
byte but JDBCs seem to like byte arrays and I'd like to be able to
export/import the associated byte arrays to databases readily.

Regards,
David Huen