[Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite

Wed Jun 9 04:33:12 UTC 2010

On Tue, Jun 8, 2010 at 9:35 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Jun 8, 2010 at 4:47 PM, Brent Pedersen <bpederse at gmail.com> wrote:
>>
>> my results may not be typical either, but using an earlier version of
>> peter's sqlite biopython branch and comparing to screed
>> (http://github.com/acr/screed), and my file-index
>> (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i
>> found that biopython's implementation is at most, a bit more than 2x
>> slower. and it does the fastq parsing much more rigorously.
>>
>> also, i didn't see much difference between berkeleydb and
>> tokyocabinet--though the ctypes-based TC wrapper i was using has since
>> been streamlined.
>> here's what i saw for 15+ million records with this script:
>> http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py
>>
>> /opt/src/methylcode/data/s_1_sequence.txt
>> benchmarking fastq file with 15646356 records (62585424 lines)
>> performing 500000 random queries
>>
>> screed
>> ------
>> create: 704.764
>> search: 51.717
>>
>> biopython-sqlite
>> ----------------
>> create: 727.868
>> search: 92.947
>>
>> fileindex
>> ---------
>> create: 294.356
>> search: 53.701
>
> Are you using a recent version of screed (with SQLite internally)?
>
> Which back end are your "fileindex" numbers for? BDB?
>
> I'd say that the slow "search" from (the old branch of) Biopython is
> down to our FASTQ parsing time, which includes lots of object
> creation. The get_raw method can be useful here depending on
> what you want to achieve:
> http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/
>
> The version you tried didn't do anything clever with the SQLite
> indexes, batched inserts etc. I'm hoping the current code will be
> faster (although there is likely a penalty from having two switchable
> back ends). Brent, could you re-run this benchmark with this code:
> http://github.com/peterjc/biopython/tree/index-sqlite-batched
>
> You'll need to change the Biopython call in your test script from
> this (it was renamed before landing on the trunk):
>
> fi = SeqIO.indexed_dict(f, idx, "fastq")
>
> to this:
>
> fi = SeqIO.index(f, idx, "fastq", db=True)
>
> or give an explicit filename:
>
> fi = SeqIO.index(f, idx, "fastq", db="/tmp/filename.idx")
>
> where db is the new parameter for controlling where and if
> the lookup table is stored on disk.
>
> Peter
>

done. the previous times and the current were using py-tcdb not bsddb.
the author of tcdb made some improvements so it's faster this time,
and your SeqIO implementation is almost 2x as fast to load as the
previous one. that's a nice implementation. i didn't try get_raw.

these timints are are with your latest version, and the version of
screed pulled from http://github.com/acr/screed master today.

/opt/src/methylcode/data/s_1_sequence.txt
benchmarking fastq file with 15646356 records (62585424 lines)
performing 500000 random queries

screed
------
create: 699.210
search: 51.043

biopython-sqlite
----------------
create: 386.647
search: 93.391

fileindex
---------
create: 184.088
search: 48.887