[Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond

Tue Apr 24 20:07:03 UTC 2012

On Tue, Apr 24, 2012 at 7:24 PM, Irwin Jungreis <ILJungr at csail.mit.edu> wrote:
> Hello Andrew and Peter.
>

Hi again Irwin,

> The size penalty of bgz versus gzip for .maf files is quite small. For
> example, compressing the 6-way C. elegans alignment .maf files is 108.9 MB
> with gzip and 112 MB with bgz, a difference of less than 3%. (Each is
> smaller than the uncompressed file by a factor of about 4 or 5.)

That's good - and given the nature of the MAF format in line with
what I was hoping for - see also the overheads I got for FASTA,
SwissProt and UniProt XML here:
http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html

> I am not very familiar with biopython, so I've been using my own utilities.
> To work with alignments I create an index file consisting of a 32-byte
> record for each maf block. Each record  contains the block start on the
> reference species chromosome, the block length on the reference species, and
> the virtual offset of the block start in the .maf file. I then have a
> utility that will extract the alignment for a given set of spliced regions,
> e.g., chrX:11568015-11569059+chrX:11569364-11569395 on the '-' strand, and
> output it as a list of pairs (assembly name, base string).
>
> I'd be happy to share, but I have no idea how this would fit into the
> existing biopython infrastructure.
>
> Best,
> Irwin

Ah - I must have misinterpreted your earlier email (off list). I'd
assumed you were using Andrew's Biopython branch which
indexes MAF files using an SQLite database of offsets. But
in practice the principle is the same - BGZF lets you have
good compression of MAF files and random access. Thank
you for clarifying this.

If you use Python at all perhaps you'd have some feedback
on Andrew's indexing plans? That would be great - Andrew's
done a great job explaining the proposed code usage here:
http://biopython.org/wiki/Multiple_Alignment_Format

Regards,

Peter