[Open-bio-l] OBDA redux? Compressed files
Peter Cock
p.j.a.cock at googlemail.com
Mon Nov 14 23:01:03 UTC 2011
On Sun, Nov 13, 2011 at 12:30 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> I've recently been experimenting with using compressed
> files - in particular simple GZIP files (ignoring any block structure)
> and BGZF (the specialised gzipped blocking used in BAM), see:
>
> http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html
> http://seqanswers.com/forums/showthread.php?t=15347
>
> The virtual offset approach used in BGZF squeezes a 16 bit
> within block offset (thus limiting you to 64kb blocks) and at
> 48 bit block start offset (thus limiting you to a 256TB file) into
> a single 64bit "virtual" offset. That makes sense if you are
> keeping the lookup table or many offsets in memory, and
> can be used as is with code expecting a single offset (like
> the current Biopython SQLite index schema).
>
> Also bzip2 ... is block based, with the block size ranging
> from 100KB to 900KB.
>
> http://bzip.org/
> http://bzip.org/1.0.5/bzip2-manual-1.0.5.html
>
A point of clarification since discovering the wikipedia page
http://en.wikipedia.org/wiki/Bzip2 to be very informative,
those are the compressed block sizes (100kb to 900kb),
and this means that after decompression a 900kb block
can in some cases reach about 46MB.
Clearly that means the BGZF virtual offset approach
cannot be applied to any bzip2 file (much like it can't
be applied to any gzip file), without imposing some
a priori limit on the decompressed size of each block.
> On the other hand, storing the block start and within block
> separately is truly generic and could be used on any blocked
> GZIP file (including BGZF) and BZIP2 etc. It would make
> the SQLite schema a bit more complicated though.
So on reflection, if we want to index any blocked compressed
file format such as GZIP file (including BGZF) and BZIP2 then
two offsets does seem to be required (the block offset, and
the data offset within the block after decompression).
Peter
More information about the Open-Bio-l
mailing list