[Biopython] random access to bgz file

Mon Apr 14 17:29:35 UTC 2014

On 2014-04-09 22:00, Peter Cock wrote: 

> On Wed, Apr 9, 2014 at 6:35 PM, tc9 <tc9 at sanger.ac.uk> wrote:
> 
>> Peter, thanks for link to html version of the bgzf documentation. Here some additional details. I am trying to do random access on a bgzipped haplotype/HAPS file. Here file format description: https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample [1] I compressed the haps file with bgzip: zcat file.haps.gz | bgzip > file.haps.bgz I know the byte position of each newline after decompression, but I need the block offsets to go from a decompressed position to a virtual offset.
> 
> Not necessarily - all you need is the virtual offset which
> handle.tell() would give you. How did you get the positions
> in the decompressed file? Can you not repeat that indexing
> but using the virtual offsets via the BGZF handle? The
> big advantage is you just use the virtual offsets without
> having to know how they are calculated.
> 
> If you really want to map from decompressed offsets to
> virtual offsets, you will need both the raw start offset of
> each block, but also the decompressed size of each
> block (often 64kb, but it can be less).

Initially I got the byte positions in the decompressed stream by reading
the entire thing once with gzip.open(). I re-read the compressed file
with BgzfReader and got the virtual offset of line number 1 million and
was able to seek that line with BgzfReader much faster than I could have
done with gzip.open(). See solution below, which I will post to a
question of mine on stackoverflow.com. 

from Bio import bgzf 

file='file.haps.gz' 

handle = bgzf.BgzfReader(file)
for i in range(10**6):
 handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close() 

handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close() 

assert line1==line2 

For completeness I want to mention that one can do: 

block_start_offset, within_block_offset =
bgzf.split_virtual_offset(virtual_offset) 

virtual_offset = bgzf.make_virtual_offset(block_start_offset,
within_block_offset) 

P.S. I was without a stable internet connection for a few days. Hence
the slow reply. Thanks for the help! 

Links:
------
[1]
https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.