[Open-bio-l] OBDA redux?

Thu Nov 17 17:11:44 UTC 2011

On Thu, Nov 17, 2011 at 02:39:49PM +0000, Peter Cock wrote:
> > +1.  This will only get worse, with the projections for upcoming HiSeq
> > upgrades, it is possible 1-2 channel runs would hit that limit.
> 
> That's a useful scale to aim to cover in profiling then, 100M to 500M
> records. Jason, do you have any more details about the slowdown
> you found with SQLite? For this use case we want to write the index
> once, and read it many times. I found it is quicker to populate the
> offset table before creating the index - perhaps you were seeing the
> index being updated while adding records?

I have also found that hammering SQLite quickly deteriorates
performance. Rather too quickly in fact. Don't forget that SQL is
inherently slower that 'simple' indexers. Also SQLite is a convenience
library, rather than a library designed for optimized performance. We
used to run sleepycat/bdb for that reason, now it is Tokyo/Kyoto
cabinet. 

In the (rather) near future we will be looking at parallel feeds from
multiple machines, to keep it somewhat interesting. Hadoop has
indexing support. In fact, Hadoop should be ideal for indexed sequence
information, though I have not used it. Still, when the time comes, I
am kinda interested in parallelized NoSQL solutions for scaling up.
Hadoop kills me because of its complexity. I hate complexity (one
reason I have tried to avoid SQL servers).

BTW 500M records takes significant RAM for an in-memory index. Quite a
number of solutions, to retain their performance, have to have the
indexes in memory. 500M records now, will grow to 500G records soon.
Just a thing to keep in mind. I would opt for a non-RAM solution.

Pj.