[Open-bio-l] OBDA redux?

Wed Nov 16 20:19:14 UTC 2011

Not to overlly advocate for the NOSQL as I think for our purposes the jury
is still out. So I think it is worth benchmarking - NOSQL and SQL-based
systems will have dfferent overheads.

I know when I have tried to store 100M -> 500M records in SQLite the
performance degrades whereas I was able to store that range of keys in
NOSQL db without problem.

I don't know if there is a generic API for the NOSQL systems which would
help for standarization.

Jason Stajich
jason at bioperl.org

On Mon, Nov 14, 2011 at 1:47 PM, Fields, Christopher J <
cjfields at illinois.edu> wrote:

> On Nov 14, 2011, at 12:14 PM, Peter Cock wrote:
>
> > Hi Chris,
> >
> > [Did you mean to CC BioPerl-l? Should I have?]
> >
> > On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J
> > <cjfields at illinois.edu> wrote:
> >> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote:
> >>
> >>> So, Chris and I seem in general agreement that an OBDA v2
> >>> using SQLite but based on essentially the same approach as
> >>> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables
> >>> mapping record identifiers to file offsets in the original sequence
> >>> files.
> >>
> >> The worry I have is adhering to a specific backend (e.g. SQLite).
> >> The reason I say this is b/c BDB in it's time was the go-to way
> >> of storing simple index data, but that is no longer feasible for
> >> very large data sets.  Who's to say something similar won't
> >> happen to SQLite, or that it is the best option available?
> >
> > Right now I would think SQLite is one of the best (if not the
> > best) option. If supporting the old back ends is important for
> > cross-project compatibility, I'm willing to have another go
> > at using BDB in Biopython, but had limited success last
> > time I tried.
>
> No, I agree re: SQLite at the moment, it's probably the best option (fast,
> widely adopted, etc), though Jason mentioned (Tokyo|Kyoto)Cabinet also
> worked very well.  I would rather not paint ourselves into a corner if the
> 'nice-and-shiny' next thing down the road performs better and gains wide
> adoption.
>
> >> Maybe we should focus on the data storage schema, as
> >> simple as it may be, then indicate the default backend
> >> must be SQLite but others are allowed (maybe with a
> >> mention that SQLite can be replaced by alternatives in
> >> the future if needed).
> >
> > It would make sense to talk about an SQL schema if
> > the "other options" would also be SQL based. But they
> > might not be... but certainly we should keep potential
> > alternative back ends in mind.
>
> It's probably necessary to allow for both possibilities (SQL and other).
>  For instance, a move to SQLite will necessitate describing the table data
> with SQL anyway.
>
> >>> I hope to get BioRuby on board, they already have an OBDA
> >>> v1 support so that shouldn't be too hard.
> >>>
> >>> Right now I don't recall if BioJava has/had OBDA v1 support,
> >>> and if they did if it was affected in their recent move to BioJava
> >>> v3 (I understand from their mailing list that some older lower
> >>> priority functionality has not all been ported yet).
> >>
> >> I wouldn't be surprised at that, OBDA kind of lingered for a
> >> while, and I'm not sure how widely adopted it became
> >> (maybe others can shed light on that?)
> >
> > Well, OBDA went beyond just indexing flat files - it also
> > tried to standard things like remote database access.
> > I don't think we every really had that side working in
> > Biopython, so I am less familiar with it. I know EMBOSS
> > has something fairly extensive for online databases,
> > but have not checked if it uses the OBDA style or their
> > own.
>
> Right, but I wonder if that may have been one problem with the original
> OBDA specification, that it was perhaps overly ambitious out-the-gate.
>
> > For now I was only planning to tackle indexing sequence
> > files in this "OBDA redux".
>
> That's a good and simpler start; the rest (remote access) fall in
> naturally once that is in place.
>
> >>> Also EMBOSS are likely to be interested, certainly Peter Rice
> >>> was interested in the SQLite indexing we're already using in
> >>> Biopython for sequence files (i.e. what is effectively the
> >>> prototype for OBDA v2).
> >>>
> >>> Note that in addition to simple indexing of text files, we are
> >>> already using the same simple offset + length approach for
> >>> indexing binary files (e.g. SFF).
> >>
> >> I think that's the general idea, that is how all bioperl data
> >> was indexed, before with the Bio::Index modules and with
> >> the OBDA implementations as well.
> >
> > Good.
> >
> >>> On the immediate practical side, I think I can edit the
> >>> current OBDA website of http://obda.open-bio.org/
> >>> via /home/websites/obda.open-bio.org/html on the
> >>> server.
> >>
> >> See below w/ regards to my thoughts on the wiki.
> >>
> >>> We need to work out where the current OBDA indexing
> >>> specification lives (CVS or SVN?) and perhaps move
> >>> that to github. We may need a general OBF organisation
> >>> account on git hub for this and any other cross-project
> >>> repositories.
> >>
> >> +1 to a move to github, but maybe this belongs in an
> >> OBF-specific organization.
> >
> > Yes, definitely under an OBF github account (not under
> > Biopython, BioPerl, etc).
> >
> >> And maybe we should take advantage of the simple
> >> wiki or project homepage that GitHub offers and move
> >> everything (docs) there.
> >
> > That could work. We'd have to go through all the old
> > documentation and relocate it, then we could make the
> > obda.open-bio.org domain point at the github pages.
>
> Yes, I think that's the idea.
>
> >>> I see there is already an OBDA project on RedMine,
> >>> (Chris can you add me to that please?)
> >>> https://redmine.open-bio.org/projects/obda
> >>>
> >>> Peter
> >>
> >> Done (last night actually, but I didn't have time to respond
> >> immediately).
> >>
> >> chris
> >
> > Thanks,
> >
> > Peter
>
> np.
>
> -c
>
>