[Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite

Tue Jun 8 15:47:18 UTC 2010

On Tue, Jun 8, 2010 at 4:00 AM, Kevin Jacobs <jacobs at bioinformed.com>
<bioinformed at gmail.com> wrote:
> On Tue, Jun 8, 2010 at 5:35 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:
>
>> On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote:
>> > On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote:
>> >>
>> >> Having now tried using this on some files with tens of millions of
>> >> records, tuning how we use SQLite is going to be important.
>> >>
>> > Wouldn't a Berkeley database be much much faster for constructing
>> > simple key to offset mappings?
>>
>> Maybe - now that I've done the refactoring on Bio.SeqIO.index() to
>> allow two back ends (python dict or SQLite) trying a third (BDB) is
>> much easier. Did you know BDB was used in the old OBDA index
>> files? However, Python 2.6 deprecated bsddb (the Python Interface
>> to Berkeley DB library) and Python is pushing people to SQLite3
>> instead.
>>
>>
> Hi Peter,
>
> I am aware that SQLite is taking over the job of serving as the default
> embedded database for Python and am in vigorous agreement with that trend.
>  I use SQLite for a wide range of tasks and am extremely happy with it for
> most applications.  Unfortunately, for pure key-value mapping tasks, I've
> found  SQLite to be 4-10x slower than a well-tuned BDB tree, even with
> batched updates and using the most aggressive SQLite performance pragmas. My
> results may not be typical, but I thought I'd raise the issue given the
> magnitude of the performance difference.
>
> Best regards,
> -Kevin
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

my results may not be typical either, but using an earlier version of
peter's sqlite biopython branch and comparing to screed
(http://github.com/acr/screed), and my file-index
(http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i
found that biopython's implementation is at most, a bit more than 2x
slower. and it does the fastq parsing much more rigorously.

also, i didn't see much difference between berkeleydb and
tokyocabinet--though the ctypes-based TC wrapper i was using has since
been streamlined.
here's what i saw for 15+ million records with this script:
http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py

/opt/src/methylcode/data/s_1_sequence.txt
benchmarking fastq file with 15646356 records (62585424 lines)
performing 500000 random queries

screed
------
create: 704.764
search: 51.717

biopython-sqlite
----------------
create: 727.868
search: 92.947

fileindex
---------
create: 294.356
search: 53.701