[GMOD-devel] Re: [Open-bio-l] Schema for genes & features & mappings to assemblies

Lincoln Stein lstein@cshl.org
Thu, 25 Apr 2002 14:47:43 -0400


On Tuesday 23 April 2002 11:24, Matthew Pocock wrote:
> Lincoln Stein wrote:
> > Chiming in here:
> >>>>We do need to discuss assemblies. I vote for "flat" one level
> >>>>assemblies
> >
> > I agree.
>
> I would strongly prefer arbitrary depth assemblies. It is the general
> case solution, and biosql is not to my mind the apropreate place to
> special-case data-models (do that in Ensembl or some other
> task-dedicated schema). The BioJava object model supports arbitrary
> depth assemblies, so if we are to use BioSQL to persist BioJava
> seqeuences, BioSQL must support them.
>
> We seem to be comming back to the same modularity issue - if one camp is
> decided upon single-depth assemblies and another camp beleives that
> arbitrary depth assemblies is a requirement then could we not just put
> assembly logic into a seperate chunk of SQL and adapters? We all still
> get to re-use the core tables as-is.
>
> If we had a multi-level-capable schema, how hard is it to execute a
> query at start-up that checks if a depth > 1 is present anywhere? You
> can probably special case this as a self-join on the assembly table,
> group and count - then load in single-level optimized adapters if the
> count is zero.

Actually, the choice isn't as stark as this.  I maintain that you can 
linearize an assembly into a 1-level structure for the purposes of storage 
and query efficiency, and then hydrate it into an n-level structure at query 
time for use with BioJava and other clients that want it.  The main issue is 
being able to perform coordinate translations with acceptable performance so 
that when the assembly changes you can carry the annotations over from one 
version to another.

I guess it comes down to what the analysis pipeline looks like.  If the 
pipeline is "freeze-assembly-then-annotate", then a flat assembly is very 
efficient.  If the pipeline is 
"assemble-and-annotate-simultaneously" then there are big benefits to 
maintaining the storage in its n-level state.  So far, all the pipelines I've 
seen have followed the first route.  This is true both for UCSC (which uses a 
completely flat storage) and EnsEMBL (which uses 2-level + stickies).

Lincoln