[Biojava-l] reading nib sequence files

Mon Jan 24 04:19:07 EST 2005

The trouble with ZIP is that to do random-access reads of the sequence
(eg. give me all bases from X to Y) you have to unzip the whole sequence
each time. That makes it quite a bit slower. The solution needs to be a
compression algorithm of some kind which allows instant random access
without slowing down the create/update process too much either. Hence a
custom fixed-width binary solution would be the first thing that comes
to mind, but it may not be the only one.

Richard Holland
Bioinformatics Specialist
GIS extension 8199   

---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------

> -----Original Message-----
> From: VERHOEF Frans 
> Sent: Monday, January 24, 2005 5:16 PM
> To: Richard HOLLAND; mark.schreiber at group.novartis.com
> Cc: Thomas Down; biojava-list List
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> 
> You could always ZIPStream it out for even more compression.
> 
> Frans
> 
> -----Original Message-----
> From: biojava-l-bounces at portal.open-bio.org 
> [mailto:biojava-l-bounces at portal.open-bio.org] On Behalf Of 
> Richard HOLLAND
> Sent: Monday, January 24, 2005 04:59 PM
> To: mark.schreiber at group.novartis.com
> Cc: Thomas Down; biojava-list List
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> NIB files store one base per 4 bits, non-variable, giving a 
> 50% compression rate and a maximum arity of 16 different base 
> values per position.
> 
> Richard Holland
> Bioinformatics Specialist
> GIS extension 8199   
>  
> ---------------------------------------------
> This email is confidential and may be privileged. If you are 
> not the intended recipient, please delete it and notify us 
> immediately. Please do not copy or use it for any purpose, or 
> disclose its content to any other person. Thank you.
> ---------------------------------------------
> 
> 
> > -----Original Message-----
> > From: mark.schreiber at group.novartis.com
> > [mailto:mark.schreiber at group.novartis.com] 
> > Sent: Monday, January 24, 2005 4:53 PM
> > To: Richard HOLLAND
> > Cc: baggott2 at llnl.gov; biojava-list List; Thomas Down
> > Subject: RE: [Biojava-l] reading nib sequence files
> > 
> > 
> > BioJava does already do some compression on large sequences
> > (or at least 
> > it used to). Like you say you can bit pack a lot. Ambiguity causes 
> > problems as you can have more than four symbols for DNA 
> > (including n, y, r 
> > etc).
> > 
> > Does Jim Kent's schema offer better compression? Even if it
> > doens't the 
> > use of a ByteBuffer will probably increase the speed of the current 
> > implementations.
> > 
> > - Mark
> > 
> > 
> > 
> > 
> > 
> > "Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
> > 01/24/2005 04:47 PM
> > 
> >  
> >         To:     Mark Schreiber/GP/Novartis at PH, "Thomas Down" 
> > <td2 at sanger.ac.uk>
> >         cc:     "biojava-list List" <biojava-l at biojava.org>, 
> > <baggott2 at llnl.gov>
> >         Subject:        RE: [Biojava-l] reading nib sequence files
> > 
> > 
> > I think the idea of storing sequences internally as 
> compressed binary 
> > sequence would be a good idea regardless, for any symbol list. 
> > Currently each Symbol in a SymbolList requires one word of 
> memory (the 
> > size of a memory pointer to the singleton Symbol 
> instances). Therefore 
> > any SymbolList of length X containing symbols from an n-ary 
> alphabet 
> > would require X words of memory to store it, plus the 
> overhead of the
> > SymbolList and n Symbol singleton instances (admittedly 
> shared between
> > all SymbolLists currently in memory).
> > 
> > If you used a compressed binary format internally, doing away with 
> > explicit Symbol references and representing each symbol in a 
> > ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G 
> > etc.), you would require much less space than even the 
> singleton model
> > above. This
> > way you could fit four DNA symbols into a single byte of memory, as
> > opposed to four words of memory. The number of bits required for a
> > symbol in any given alphabet is merely log base 2 of the size of the
> > alphabet, rounded up to the nearest whole number. eg. for 
> the English
> > alphabet of 26 letters only, you would need 5 bits, or in 
> > terms of whole
> > bytes, you would be able to fit 8 symbols into 5 bytes. 
> > 
> > To do this you would need to define a 'bits' parameter on 
> the alphabet 
> > which is calculated from the number of symbols in the alphabet, a 
> > 'bitMap' parameter on the alphabet which maps symbols to bit values 
> > (and vice versa with 'inverseBitMap'), and keep a separate
> > 'length' parameter
> > in the SymbolList which would be used to tell the binary 
> > decoder when to
> > stop parsing the sequence (as you can only store whole bytes, 
> > there will
> > often be trailing zeroes in the buffer which could be 
> > misleading without
> > this extra parameter).
> > 
> > You could always return singleton Symbol objects if requested, by 
> > decoding the binary sequence on the fly, but you would no 
> longer need 
> > to store the sequence using them.
> > 
> > Is this worth considering for the big BioJava rewrite?
> > 
> > Richard Holland
> > Bioinformatics Specialist
> > GIS extension 8199
> >  
> > ---------------------------------------------
> > This email is confidential and may be privileged. If you 
> are not the 
> > intended recipient, please delete it and notify us 
> immediately. Please 
> > do not copy or use it for any purpose, or disclose its 
> content to any 
> > other person. Thank you.
> > ---------------------------------------------
> > 
> > 
> > > -----Original Message-----
> > > From: mark.schreiber at group.novartis.com
> > > [mailto:mark.schreiber at group.novartis.com] 
> > > Sent: Monday, January 24, 2005 4:37 PM
> > > To: Thomas Down
> > > Cc: biojava-list List; Richard HOLLAND; 
> > > "<baggott2 at llnl.gov"@novartis.com
> > > Subject: Re: [Biojava-l] reading nib sequence files
> > > 
> > > 
> > > I'd need to brush up on my nio, and my c !
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Thomas Down <td2 at sanger.ac.uk>
> > > 01/24/2005 04:34 PM
> > > 
> > > 
> > >         To:     "Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
> > >         cc:     "<baggott2 at llnl.gov>", biojava-list List 
> > > <biojava-l at biojava.org>, Mark
> > > Schreiber/GP/Novartis at PH
> > >         Subject:        Re: [Biojava-l] reading nib sequence files
> > > 
> > > 
> > > 
> > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> > > 
> > > > It's a compressed binary format. I doubt BioJava would be
> > > able to read
> > > > it without a lot of effort as the current parser framework
> > > is set up
> > > > for
> > > > text input only.
> > > 
> > > Nib support probably wouldn't fit into the text-oriented parsing
> > > framework, but I'm sure it could be supported somehow if 
> there was 
> > > demand.  A quick google doesn't turn up any format 
> > documentation, but
> > > Jim Kent's IO code is at:
> > > 
> > >            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> > > 
> > > One interesting way to handle this might be to open the nib
> > file as a
> > > MappedByteBuffer, and back a SymbolList directly using that --
> > > potentially giving us an efficient way of working with huge 
> > > sequences.. 
> > >   Any interest in that?
> > > 
> > >            Thomas.
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at biojava.org 
> http://biojava.org/mailman/listinfo/biojava-l
>