[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Peter biopython at maubp.freeserve.co.uk
Fri Jun 4 12:59:22 UTC 2010


On Fri, Jun 4, 2010 at 11:53 AM, Kevin <aboulia at gmail.com> wrote:
> I vote for sqlite index. Have been using bsddb to do the same but the db
> is inflated compared to plain text. Performance is not bad using btree

The other major point against bsddb is that future versions of Python
will not include it in the standard library - but Python 2.5+ does have
sqlite3 included.

> For gzip I feel it might be possible to gunzip into a stream which
> biopython can parse on the fly?

Yes of course, like this:

import gzip
from Bio import SeqIO
handle = gzip.open("uniprot_sprot.dat.gz")
for record in SeqIO.parse(handle, "swiss"): print record.id
handle.close()

Parsing is easy - the point of this discussion is random access to
any record within the stream (which requires jumping to an offset).

Peter



More information about the Biopython mailing list