[Biojava-l] Compress Sequences.

mark.schreiber at novartis.com mark.schreiber at novartis.com
Fri Aug 12 02:45:51 EDT 2005


Check out PackedSymbolList and the associated classes and interfaces 
PackedSymbolListFactory, Packing, and Packing factory. These do bit 
packing of 
sequences. The nice part with these is they behave exactly like normal 
SymbolLists so you don't even know your dealing with a compressed 
sequence.

>From the java docs.

Example Usage
 SymbolList symL = ...;
 SymbolList packed = new PackedSymbolList(
   PackingFactory.getPacking(symL.getAlphabet(), true),
   symL
 );


It is also relatively trivial to write a Huffman tree generator that can 
compress SymbolLists as a binary string. You could use this as the bases 
for full LZ compression. There are also very much more complicated 
algorithms published that look for long range repeats, these are also very 
slow.

- Mark





Felipe Albrecht <felipe.albrecht at gmail.com>
Sent by: biojava-l-bounces at portal.open-bio.org
08/12/2005 04:07 AM

 
        To:     biojava-l at biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Compress Sequences.


Has some class in biojava that compress sequences?
For example, put four nucleotides in a single byte.

If dont exist, someone knows a good algorithm for compress, read and
compare this sequence?

Thanks.


Felipe Albrecht

_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l





More information about the Biojava-l mailing list