[Open-bio-l] OBDA redux?

Thu Nov 17 00:00:50 UTC 2011

Hi Jason,

I was not actively following this thread but have one comment:

> I don't know if there is a generic API for the NOSQL systems which would
> help for standarization.

To my knowledge, RDF/SPARQL is the only standardized format/protocol
among the NoSQL stores. Unfortunately, its performance and scalability
are not yet comparable to the widely used key-value stores (e.g. Tokyo
Cabinet), however, Semantic Web may have a potential to be a standard
for storing heterogeneous data sets as an integrated biological DB
without designing any schema (we need ontologies instead).

Cheers,
Toshiaki Katayama

On 2011/11/17, at 5:19, Jason Stajich wrote:

> Not to overlly advocate for the NOSQL as I think for our purposes the jury
> is still out. So I think it is worth benchmarking - NOSQL and SQL-based
> systems will have dfferent overheads.
> 
> I know when I have tried to store 100M -> 500M records in SQLite the
> performance degrades whereas I was able to store that range of keys in
> NOSQL db without problem.
> 
> I don't know if there is a generic API for the NOSQL systems which would
> help for standarization.
> 
> Jason Stajich
> jason at bioperl.org
> 
> 
> On Mon, Nov 14, 2011 at 1:47 PM, Fields, Christopher J <
> cjfields at illinois.edu> wrote:
> 
>> On Nov 14, 2011, at 12:14 PM, Peter Cock wrote:
>> 
>>> Hi Chris,
>>> 
>>> [Did you mean to CC BioPerl-l? Should I have?]
>>> 
>>> On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J
>>> <cjfields at illinois.edu> wrote:
>>>> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote:
>>>> 
>>>>> So, Chris and I seem in general agreement that an OBDA v2
>>>>> using SQLite but based on essentially the same approach as
>>>>> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables
>>>>> mapping record identifiers to file offsets in the original sequence
>>>>> files.
>>>> 
>>>> The worry I have is adhering to a specific backend (e.g. SQLite).
>>>> The reason I say this is b/c BDB in it's time was the go-to way
>>>> of storing simple index data, but that is no longer feasible for
>>>> very large data sets.  Who's to say something similar won't
>>>> happen to SQLite, or that it is the best option available?
>>> 
>>> Right now I would think SQLite is one of the best (if not the
>>> best) option. If supporting the old back ends is important for
>>> cross-project compatibility, I'm willing to have another go
>>> at using BDB in Biopython, but had limited success last
>>> time I tried.
>> 
>> No, I agree re: SQLite at the moment, it's probably the best option (fast,
>> widely adopted, etc), though Jason mentioned (Tokyo|Kyoto)Cabinet also
>> worked very well.  I would rather not paint ourselves into a corner if the
>> 'nice-and-shiny' next thing down the road performs better and gains wide
>> adoption.
>> 
>>>> Maybe we should focus on the data storage schema, as
>>>> simple as it may be, then indicate the default backend
>>>> must be SQLite but others are allowed (maybe with a
>>>> mention that SQLite can be replaced by alternatives in
>>>> the future if needed).
>>> 
>>> It would make sense to talk about an SQL schema if
>>> the "other options" would also be SQL based. But they
>>> might not be... but certainly we should keep potential
>>> alternative back ends in mind.
>> 
>> It's probably necessary to allow for both possibilities (SQL and other).
>> For instance, a move to SQLite will necessitate describing the table data
>> with SQL anyway.
>> 
>>>>> I hope to get BioRuby on board, they already have an OBDA
>>>>> v1 support so that shouldn't be too hard.
>>>>> 
>>>>> Right now I don't recall if BioJava has/had OBDA v1 support,
>>>>> and if they did if it was affected in their recent move to BioJava
>>>>> v3 (I understand from their mailing list that some older lower
>>>>> priority functionality has not all been ported yet).
>>>> 
>>>> I wouldn't be surprised at that, OBDA kind of lingered for a
>>>> while, and I'm not sure how widely adopted it became
>>>> (maybe others can shed light on that?)
>>> 
>>> Well, OBDA went beyond just indexing flat files - it also
>>> tried to standard things like remote database access.
>>> I don't think we every really had that side working in
>>> Biopython, so I am less familiar with it. I know EMBOSS
>>> has something fairly extensive for online databases,
>>> but have not checked if it uses the OBDA style or their
>>> own.
>> 
>> Right, but I wonder if that may have been one problem with the original
>> OBDA specification, that it was perhaps overly ambitious out-the-gate.
>> 
>>> For now I was only planning to tackle indexing sequence
>>> files in this "OBDA redux".
>> 
>> That's a good and simpler start; the rest (remote access) fall in
>> naturally once that is in place.
>> 
>>>>> Also EMBOSS are likely to be interested, certainly Peter Rice
>>>>> was interested in the SQLite indexing we're already using in
>>>>> Biopython for sequence files (i.e. what is effectively the
>>>>> prototype for OBDA v2).
>>>>> 
>>>>> Note that in addition to simple indexing of text files, we are
>>>>> already using the same simple offset + length approach for
>>>>> indexing binary files (e.g. SFF).
>>>> 
>>>> I think that's the general idea, that is how all bioperl data
>>>> was indexed, before with the Bio::Index modules and with
>>>> the OBDA implementations as well.
>>> 
>>> Good.
>>> 
>>>>> On the immediate practical side, I think I can edit the
>>>>> current OBDA website of http://obda.open-bio.org/
>>>>> via /home/websites/obda.open-bio.org/html on the
>>>>> server.
>>>> 
>>>> See below w/ regards to my thoughts on the wiki.
>>>> 
>>>>> We need to work out where the current OBDA indexing
>>>>> specification lives (CVS or SVN?) and perhaps move
>>>>> that to github. We may need a general OBF organisation
>>>>> account on git hub for this and any other cross-project
>>>>> repositories.
>>>> 
>>>> +1 to a move to github, but maybe this belongs in an
>>>> OBF-specific organization.
>>> 
>>> Yes, definitely under an OBF github account (not under
>>> Biopython, BioPerl, etc).
>>> 
>>>> And maybe we should take advantage of the simple
>>>> wiki or project homepage that GitHub offers and move
>>>> everything (docs) there.
>>> 
>>> That could work. We'd have to go through all the old
>>> documentation and relocate it, then we could make the
>>> obda.open-bio.org domain point at the github pages.
>> 
>> Yes, I think that's the idea.
>> 
>>>>> I see there is already an OBDA project on RedMine,
>>>>> (Chris can you add me to that please?)
>>>>> https://redmine.open-bio.org/projects/obda
>>>>> 
>>>>> Peter
>>>> 
>>>> Done (last night actually, but I didn't have time to respond
>>>> immediately).
>>>> 
>>>> chris
>>> 
>>> Thanks,
>>> 
>>> Peter
>> 
>> np.
>> 
>> -c
>> 
>> 
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l