[Biojava-l] reading nib sequence files

Mon Jan 24 03:52:46 EST 2005

BioJava does already do some compression on large sequences (or at least 
it used to). Like you say you can bit pack a lot. Ambiguity causes 
problems as you can have more than four symbols for DNA (including n, y, r 
etc).

Does Jim Kent's schema offer better compression? Even if it doens't the 
use of a ByteBuffer will probably increase the speed of the current 
implementations.

- Mark

"Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
01/24/2005 04:47 PM

        To:     Mark Schreiber/GP/Novartis at PH, "Thomas Down" <td2 at sanger.ac.uk>
        cc:     "biojava-list List" <biojava-l at biojava.org>, <baggott2 at llnl.gov>
        Subject:        RE: [Biojava-l] reading nib sequence files

I think the idea of storing sequences internally as compressed binary
sequence would be a good idea regardless, for any symbol list. Currently
each Symbol in a SymbolList requires one word of memory (the size of a
memory pointer to the singleton Symbol instances). Therefore any
SymbolList of length X containing symbols from an n-ary alphabet would
require X words of memory to store it, plus the overhead of the
SymbolList and n Symbol singleton instances (admittedly shared between
all SymbolLists currently in memory).

If you used a compressed binary format internally, doing away with
explicit Symbol references and representing each symbol in a ByteBuffer
as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you
would require much less space than even the singleton model above. This
way you could fit four DNA symbols into a single byte of memory, as
opposed to four words of memory. The number of bits required for a
symbol in any given alphabet is merely log base 2 of the size of the
alphabet, rounded up to the nearest whole number. eg. for the English
alphabet of 26 letters only, you would need 5 bits, or in terms of whole
bytes, you would be able to fit 8 symbols into 5 bytes. 

To do this you would need to define a 'bits' parameter on the alphabet
which is calculated from the number of symbols in the alphabet, a
'bitMap' parameter on the alphabet which maps symbols to bit values (and
vice versa with 'inverseBitMap'), and keep a separate 'length' parameter
in the SymbolList which would be used to tell the binary decoder when to
stop parsing the sequence (as you can only store whole bytes, there will
often be trailing zeroes in the buffer which could be misleading without
this extra parameter).

You could always return singleton Symbol objects if requested, by
decoding the binary sequence on the fly, but you would no longer need to
store the sequence using them.

Is this worth considering for the big BioJava rewrite?

Richard Holland
Bioinformatics Specialist
GIS extension 8199 

---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------

> -----Original Message-----
> From: mark.schreiber at group.novartis.com 
> [mailto:mark.schreiber at group.novartis.com] 
> Sent: Monday, January 24, 2005 4:37 PM
> To: Thomas Down
> Cc: biojava-list List; Richard HOLLAND; 
> "<baggott2 at llnl.gov"@novartis.com
> Subject: Re: [Biojava-l] reading nib sequence files
> 
> 
> I'd need to brush up on my nio, and my c !
> 
> 
> 
> 
> 
> Thomas Down <td2 at sanger.ac.uk>
> 01/24/2005 04:34 PM
> 
> 
>         To:     "Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
>         cc:     "<baggott2 at llnl.gov>", biojava-list List 
> <biojava-l at biojava.org>, Mark 
> Schreiber/GP/Novartis at PH
>         Subject:        Re: [Biojava-l] reading nib sequence files
> 
> 
> 
> On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> 
> > It's a compressed binary format. I doubt BioJava would be 
> able to read
> > it without a lot of effort as the current parser framework 
> is set up 
> > for
> > text input only.
> 
> Nib support probably wouldn't fit into the text-oriented parsing 
> framework, but I'm sure it could be supported somehow if there was 
> demand.  A quick google doesn't turn up any format documentation, but 
> Jim Kent's IO code is at:
> 
>            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> 
> One interesting way to handle this might be to open the nib file as a 
> MappedByteBuffer, and back a SymbolList directly using that -- 
> potentially giving us an efficient way of working with huge 
> sequences.. 
>   Any interest in that?
> 
>            Thomas.
> 
> 
> 
> 
>