[Biojava-l] reading nib sequence files

Mon Jan 24 04:17:16 EST 2005

BioJava uses (or at least can use) the PackedSymbolList for large 
sequences. It uses an array of longs to represent the packed bits.

There may be some advantage to using a ByteBuffer, hard to know.

- Mark

"Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
01/24/2005 04:59 PM

        To:     Mark Schreiber/GP/Novartis at PH
        cc:     <baggott2 at llnl.gov>, "biojava-list List" <biojava-l at biojava.org>, "Thomas 
Down" <td2 at sanger.ac.uk>
        Subject:        RE: [Biojava-l] reading nib sequence files

NIB files store one base per 4 bits, non-variable, giving a 50%
compression rate and a maximum arity of 16 different base values per
position.

Richard Holland
Bioinformatics Specialist
GIS extension 8199 

---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------

> -----Original Message-----
> From: mark.schreiber at group.novartis.com 
> [mailto:mark.schreiber at group.novartis.com] 
> Sent: Monday, January 24, 2005 4:53 PM
> To: Richard HOLLAND
> Cc: baggott2 at llnl.gov; biojava-list List; Thomas Down
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> 
> BioJava does already do some compression on large sequences 
> (or at least 
> it used to). Like you say you can bit pack a lot. Ambiguity causes 
> problems as you can have more than four symbols for DNA 
> (including n, y, r 
> etc).
> 
> Does Jim Kent's schema offer better compression? Even if it 
> doens't the 
> use of a ByteBuffer will probably increase the speed of the current 
> implementations.
> 
> - Mark
> 
> 
> 
> 
> 
> "Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
> 01/24/2005 04:47 PM
> 
> 
>         To:     Mark Schreiber/GP/Novartis at PH, "Thomas Down" 
> <td2 at sanger.ac.uk>
>         cc:     "biojava-list List" <biojava-l at biojava.org>, 
> <baggott2 at llnl.gov>
>         Subject:        RE: [Biojava-l] reading nib sequence files
> 
> 
> I think the idea of storing sequences internally as compressed binary
> sequence would be a good idea regardless, for any symbol 
> list. Currently
> each Symbol in a SymbolList requires one word of memory (the size of a
> memory pointer to the singleton Symbol instances). Therefore any
> SymbolList of length X containing symbols from an n-ary alphabet would
> require X words of memory to store it, plus the overhead of the
> SymbolList and n Symbol singleton instances (admittedly shared between
> all SymbolLists currently in memory).
> 
> If you used a compressed binary format internally, doing away with
> explicit Symbol references and representing each symbol in a 
> ByteBuffer
> as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you
> would require much less space than even the singleton model 
> above. This
> way you could fit four DNA symbols into a single byte of memory, as
> opposed to four words of memory. The number of bits required for a
> symbol in any given alphabet is merely log base 2 of the size of the
> alphabet, rounded up to the nearest whole number. eg. for the English
> alphabet of 26 letters only, you would need 5 bits, or in 
> terms of whole
> bytes, you would be able to fit 8 symbols into 5 bytes. 
> 
> To do this you would need to define a 'bits' parameter on the alphabet
> which is calculated from the number of symbols in the alphabet, a
> 'bitMap' parameter on the alphabet which maps symbols to bit 
> values (and
> vice versa with 'inverseBitMap'), and keep a separate 
> 'length' parameter
> in the SymbolList which would be used to tell the binary 
> decoder when to
> stop parsing the sequence (as you can only store whole bytes, 
> there will
> often be trailing zeroes in the buffer which could be 
> misleading without
> this extra parameter).
> 
> You could always return singleton Symbol objects if requested, by
> decoding the binary sequence on the fly, but you would no 
> longer need to
> store the sequence using them.
> 
> Is this worth considering for the big BioJava rewrite?
> 
> Richard Holland
> Bioinformatics Specialist
> GIS extension 8199 
> 
> ---------------------------------------------
> This email is confidential and may be privileged. If you are not the
> intended recipient, please delete it and notify us immediately. Please
> do not copy or use it for any purpose, or disclose its content to any
> other person. Thank you.
> ---------------------------------------------
> 
> 
> > -----Original Message-----
> > From: mark.schreiber at group.novartis.com 
> > [mailto:mark.schreiber at group.novartis.com] 
> > Sent: Monday, January 24, 2005 4:37 PM
> > To: Thomas Down
> > Cc: biojava-list List; Richard HOLLAND; 
> > "<baggott2 at llnl.gov"@novartis.com
> > Subject: Re: [Biojava-l] reading nib sequence files
> > 
> > 
> > I'd need to brush up on my nio, and my c !
> > 
> > 
> > 
> > 
> > 
> > Thomas Down <td2 at sanger.ac.uk>
> > 01/24/2005 04:34 PM
> > 
> > 
> >         To:     "Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
> >         cc:     "<baggott2 at llnl.gov>", biojava-list List 
> > <biojava-l at biojava.org>, Mark 
> > Schreiber/GP/Novartis at PH
> >         Subject:        Re: [Biojava-l] reading nib sequence files
> > 
> > 
> > 
> > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> > 
> > > It's a compressed binary format. I doubt BioJava would be 
> > able to read
> > > it without a lot of effort as the current parser framework 
> > is set up 
> > > for
> > > text input only.
> > 
> > Nib support probably wouldn't fit into the text-oriented parsing 
> > framework, but I'm sure it could be supported somehow if there was 
> > demand.  A quick google doesn't turn up any format 
> documentation, but 
> > Jim Kent's IO code is at:
> > 
> >            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> > 
> > One interesting way to handle this might be to open the nib 
> file as a 
> > MappedByteBuffer, and back a SymbolList directly using that -- 
> > potentially giving us an efficient way of working with huge 
> > sequences.. 
> >   Any interest in that?
> > 
> >            Thomas.
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
>