[Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite

Wed Jun 9 14:42:29 UTC 2010

On Wed, Jun 9, 2010 at 1:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Jun 9, 2010 at 5:33 AM, Brent Pedersen <bpederse at gmail.com> wrote:
>>>
>>> The version you tried didn't do anything clever with the SQLite
>>> indexes, batched inserts etc. I'm hoping the current code will be
>>> faster (although there is likely a penalty from having two switchable
>>> back ends). Brent, could you re-run this benchmark with this code:
>>> http://github.com/peterjc/biopython/tree/index-sqlite-batched
>>> ...
>>
>> done.
>
> Thank you Brent :)
>
>> the previous times and the current were using py-tcdb not bsddb.
>> the author of tcdb made some improvements so it's faster this time,
>
> OK, so you are using Tokyo Cabinet to store the lookup table here
> rather than BDB. Link, http://code.google.com/p/py-tcdb/
>
>> and your SeqIO implementation is almost 2x as fast to load as the
>> previous one. that's a nice implementation. i didn't try get_raw.
>
> I've got some more re-factoring in mind which should help a little
> more (but mainly to make the structure clearer).
>
>> these timints are are with your latest version, and the version of
>> screed pulled from http://github.com/acr/screed master today.
>
> Having had a quick look, they are using SQLite3 in much the
> say way as I was initially. They create the index before loading
> (rather than after loading) and they use a single insert per
> offset (rather than using a batch in a transaction or the
> executemany method). I'm pretty sure from my experiments
> those changes would speed up screed's loading time a lot
> (probably inline with the speed up I achieved).
>
>> /opt/src/methylcode/data/s_1_sequence.txt
>> benchmarking fastq file with 15646356 records (62585424 lines)
>> performing 500000 random queries
>>
>> screed
>> ------
>> create: 699.210
>> search: 51.043
>>
>> biopython-sqlite
>> ----------------
>> create: 386.647
>> search: 93.391
>>
>> fileindex
>> ---------
>> create: 184.088
>> search: 48.887
>
> That's got us looking more competitive. As noted above, I think
> sceed's loading time could be much reduced by tweaking how
> they use SQLite3. I wonder what the breakdown for fileindex is
> between calling Tokyo Cabinet and the fileindex code itself?
> I guess we should try TK as the back end in Bio.SeqIO.index()
> for comparison.
>
> Peter
>
> P.S. Could you measure the database file sizes on disk?
>

for raw reads, screed, fileindex(tcdb), biopython respectively:
-rw-r--r-T 1 brentp users  3.3G 2009-11-17 13:32
/opt/src/methylcode/data/s_1_sequence.txt
-rw-r--r-- 1 brentp brentp 3.8G 2010-06-08 16:09
/opt/src/methylcode/data/s_1_sequence.txt_screed
-rw-r--r-- 1 brentp brentp 1.2G 2010-06-08 16:21
/opt/src/methylcode/data/s_1_sequence.txt.fidx
-rw-r--r-- 1 brentp brentp 1.5G 2010-06-08 21:15
/opt/src/methylcode/data/s_1_sequence.txt.bidx

that's not using any compression for the fileindex.
i think the overhead of the fileindex code + tcdb code is pretty low
now. i think there'd only be improvement
using a cython or c version of a TC wrapper--and even then, not much.

-brentp