[Bioperl-l] OBDA redux?

Thu Nov 3 19:47:51 UTC 2011

On Nov 3, 2011, at 1:52 PM, Peter Cock wrote:

> On Thu, Nov 3, 2011 at 6:28 PM, Fields, Christopher J
> <cjfields at illinois.edu> wrote:
>> (side thread, so re-titling...)
>> 
> And CC'ing open-bio-l, which is a better home for this than bioperl-l,
> where OBDA v2 talk came up again in discussion of a BioPerl indexing
> problem. Archive links for thread here:
> 
> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035807.html
> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035808.html
> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035811.html
> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035812.html
> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035813.html
> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035822.html

yes, good idea...

>> On Nov 1, 2011, at 1:06 PM, Peter Cock wrote:
>>> 
>>> Yes, we're using SQLite3 to store essentially a list of filenames
>>> and their format as one table, and then in the main table an
>>> entry for each sequence recording the ID (only one accession,
>>> unlike OBDA which had infrastructure for a secondary accession),
>>> file number, offset of the start of the record, and optionally the
>>> length of the record on disk.
>>> 
>>> i.e. Basically what OBDA does, but using SQLite rather
>>> than BDB (not included in Python 3) or a flat file index
>>> (poor performance with large datasets).
>>> 
>>> I find this design attractive on several levels:
>>> * File format neutral, covers FASTA, FASTQ, GenBank, etc
>>> * Preserves the original file untouched
>>> * Index is a small single file (thanks to SQLite)
>>> * Back end could be switched out
>>> * Could be applied to compressed file formats
>>> * Reuses existing parsing code to access entries
>>> 
>>> This could easily form basis of OBDA v2, the main points
>>> of difference I anticipate between the Bio* projects would
>>> be naming conventions for the different file formats, and
>>> what we consider to be the default record ID of each read
>>> (e.g. which field in a GenBank file - although agreement
>>> here is not essential). Some of that was already settled in
>>> principle with OBDA v1.
>> 
>> The primary/secondary IDs could be configurable with a sane
>> default, I think the bioperl implementations allowed this (and
>> it is certainly something that will be requested).
> 
> One reason I went with a single ID only was to keep the
> Python dictionary based API simple (think hash in Perl).
> You don't get secondary keys in a Python dict or a hash ;)
> 
> As a nod to flexibility, in Biopython's Bio.SeqIO indexing you
> can provide a call back function to map the suggested ID to
> something else. Obviously this doesn't give the full flexibility
> of extracting a field from the record's annotation because we
> don't parse the whole record during indexing (it would be too
> slow).

Same with bioperl.

> However, I'm happy for there to be an *optional* secondary
> key in an OBDA v2 SQLite schema, but Biopython probably
> won't populate it. We could optionally use it rather than the
> primary ID on loading an existing index though.

Optional implementation of that is fine by me.

> Personally I would stick with one key in the index - it should
> be faster and makes it simpler to switch the back end if we
> need to later. If anyone wants a second key, they can build
> a second index *grin*.

That's easy enough.

>>> On the other hand, you could try and store the parsed data
>>> itself, which is where NOSQL looks more interesting. That
>>> essentially requires the ability to serialise your annotated
>>> sequence object model to disk - which would be tricky to do
>>> cross project (much more ambitious than BioSQL is). It also
>>> means the "index" becomes very large because it now holds
>>> all the original data.
>>> 
>>> Peter
>> 
>> For a fully cross-Bio* compliant format, I don't think it's feasible
>> to use serialized data unless they are serialized in something
>> that is easily deserialized across HLLs (JSON, BSON, YAML,
>> XML, etc).  Either that, or such data is stored concurrently with
>> the binary blob, along with meta data that indicates the source
>> of the blob, parser, version, etc, etc (unless there are tools out
>> there that reliably interconvert serialized complex data structures
>> between HLLs).  Anyway you go about it, it seems like it could
>> be a major ball of hurt, unless implemented very carefully.
> 
> You missed out RDF as a serialisation ;)
> 
> But yes, going down the shared serialisation route is going
> to be messy - as you are well aware:
> 
>> Aside: I think this was one of the problems with
>> Bio::DB::SeqFeature::Store, in that it at one point stored
>> Perl-specific Storable blobs.
>> 
>> chris
> 
> Peter

yes, it's a problem w/o an easy solution.  Anyway, I think an implementation of such at this point would be a premature optimization.  

chris