[GMOD-devel] Re: [Open-bio-l] Schema for genes & features & mappings to assemblies

Chris Mungall cjm@bdgp.lbl.gov
Wed, 24 Apr 2002 18:43:42 -0700 (PDT)


On Tue, 23 Apr 2002, Lincoln Stein wrote:

> On Tuesday 23 April 2002 07:07, Elia Stupka wrote:
> > > Do you really want to special-case gene structures?  I thought
> >
> > Hmm... I agree with you, I like that, I guess then what we need to work on
> > is the clever code that would drive it. Coincidentally we are just
> > discussing super-non-hierarchical features for our comparative analysis
> > db, so we might end up coding this, if we want it all to run outside
> > ensembl on the bioperl-pipeline.
> >
> > Elia
>
> The way I took with Bio::DB::GFF is the following:
>
> 	- all features are stored as tag/values in a single table (normalized for
> 		tag names)
>
> 	- a series of "aggregator" classes are responsible for taking certain
> 	sets of tags and constructing rich objects from them.  For example, the
> 	Bio::DB::GFF::Aggregator::transcript class looks for tags named
> 	"exon", "cds", "polyA-site" and so forth and uses them to construct a
> 	transcript object.
>
> 	- you can create your own aggregators on the fly using an aggregatorFactory,
> 	or use "static" aggregators stored in .pm files.
>
> I think this is similar to Jason's recent Builder interface.  The strategy
> has pluses and minuses.  The plus is that you don't have to futz with the
> schema every time you want to add a new component to your gene.  The minus is
> that it's easy for the database to drift -- no referential integrity.
> There's also a whiff of the AceDB "magic tag" syndrome here.

We could use Michael Ashburner's SO sequence feature type ontology to
define the aggregator;

eg

(partOf exon transcript)
(partOf cds  transcript)
(partOf polyA-site transcript)

this fits into the existing biosql seqfeature, seqfeature_relationship,
ontology_term, ontology_relationship tables rather nicely, so you have the
flexibility (stable relational schema) plus the integrity & drift control.

the integrity could either be directly enforced through some (admittedly
rather complex and possibly slow) generic SQL (not in mySQL) or through
the application layer.

And you could also automatically generate views for gene, transcript, exon
etc relations plus their linking tables if you like that sort of thing.

Sorry if you're heard my ontologies eulogising before, but this is a
different mail list(s) so I'm sure there's a few who haven't heard it.

> Lincoln
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l@open-bio.org
> http://open-bio.org/mailman/listinfo/open-bio-l