[Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many

Mon Dec 20 19:09:04 UTC 2010

On Tue, Nov 30, 2010 at 11:24 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> One thing I haven't done yet (any volunteers?) is any
> benchmarking - for example comparing the index
> build and retrieval times for some large files using
> Biopython 1.55 (recent baseline), Biopython 1.56
> (should be faster on retrieval) and the branch to
> check for any regressions in Bio.SeqIO.index(), and
> compare this to Bio.SeqIO.index_many() which being
> disk based will be slower but require much less RAM.
>

Testing here is complicated because each file format
can behave differently.

I've noticed a slight regression for GenBank indexing,
particularly for building the index where I also now
track the end of each record (although this is not used
for the Bio.SeqIO.index code), and can probably be
improved on.

e.g. Using the current trunk code for the 240MB
GenBank file gbvrt1.seq with 31065 records and
Bio.SeqIO.index() we have:

Indexed in 5.2s
All with get_raw took 5.53s
All as SeqRecord objects took 24.08s

Using the branch, and Bio.SeqIO.index()

gbvrt1.seq contains 31065 records
Indexed in 7.1s
All with get_raw took 6.08s
All as SeqRecord objects took 24.60s

Using the branch, and Bio.SeqIO.index_db()

Indexed in 7.2s
All with get_raw took 1.75s
All as SeqRecord objects took 25.15s

I haven't looked at EMBL, SwissProt or UniPort XML files
yet - but I expect their behaviour to be similar.

The major use case for indexing large files is probably
FASTA and FASTQ. Testing on FASTQ files with 7 million
or so entries shows very little change - which is good :)
I really should have made a note of the timings, but I
don't have time right now to repeat them, maybe tomorrow.

Here are timings from a smaller file, contains 1253960 records
from a Roche 454 run in FASTQ format.

Using the trunk and Bio.SeqIO.index()

Indexed in 20.1s
All with get_raw took 34.70s
All as SeqRecord objects took 234.68s

Using the branch and Bio.SeqIO.index()

Indexed in 20.8s
All with get_raw took 35.86s
All as SeqRecord objects took 238.28s

Using the branch and Bio.SeqIO.index_db()

Indexed in 41.9s
All with get_raw took 41.20s
All as SeqRecord objects took 271.26s

This example shows Bio.SeqIO.index() remains about the
same speed as before for FASTQ files.

The other general message is that for large files (many
records), using the SQLite back end does slow down the
index building step, but access to the records remains
very competitive with the in memory Python dict. And of
course you can scale to index files bigger than you could
otherwise.

Peter