[Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite

Tue Jun 8 16:35:07 UTC 2010

On Tue, Jun 8, 2010 at 4:47 PM, Brent Pedersen <bpederse at gmail.com> wrote:
>
> my results may not be typical either, but using an earlier version of
> peter's sqlite biopython branch and comparing to screed
> (http://github.com/acr/screed), and my file-index
> (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i
> found that biopython's implementation is at most, a bit more than 2x
> slower. and it does the fastq parsing much more rigorously.
>
> also, i didn't see much difference between berkeleydb and
> tokyocabinet--though the ctypes-based TC wrapper i was using has since
> been streamlined.
> here's what i saw for 15+ million records with this script:
> http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py
>
> /opt/src/methylcode/data/s_1_sequence.txt
> benchmarking fastq file with 15646356 records (62585424 lines)
> performing 500000 random queries
>
> screed
> ------
> create: 704.764
> search: 51.717
>
> biopython-sqlite
> ----------------
> create: 727.868
> search: 92.947
>
> fileindex
> ---------
> create: 294.356
> search: 53.701

Are you using a recent version of screed (with SQLite internally)?

Which back end are your "fileindex" numbers for? BDB?

I'd say that the slow "search" from (the old branch of) Biopython is
down to our FASTQ parsing time, which includes lots of object
creation. The get_raw method can be useful here depending on
what you want to achieve:
http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/

The version you tried didn't do anything clever with the SQLite
indexes, batched inserts etc. I'm hoping the current code will be
faster (although there is likely a penalty from having two switchable
back ends). Brent, could you re-run this benchmark with this code:
http://github.com/peterjc/biopython/tree/index-sqlite-batched

You'll need to change the Biopython call in your test script from
this (it was renamed before landing on the trunk):

fi = SeqIO.indexed_dict(f, idx, "fastq")

to this:

fi = SeqIO.index(f, idx, "fastq", db=True)

or give an explicit filename:

fi = SeqIO.index(f, idx, "fastq", db="/tmp/filename.idx")

where db is the new parameter for controlling where and if
the lookup table is stored on disk.

Peter