[BioSQL-l] Seqfeature_Source

Thomas Down td2@sanger.ac.uk
Mon, 23 Sep 2002 16:32:42 +0100


On Mon, Sep 23, 2002 at 12:42:25AM -0700, Hilmar Lapp wrote:
> >
> >How do you plan to do this?  I can think of three possibilities:
> >
> >  - Have a standard tag for seqfeature_source, and then put
> >    the source value (as a string) in the current
> >    seqfeature_qualifier_value table.  I don't have any particular
> >    objections to this, but it's got the same problem as putting
> >    the source as a text attribute in the main seqfeature
> >    table: it leaves the source as an opaque string.
> 
> Why is the string in seqfeature_source so different from this?

It's normalized.  Multiple features can point to the same
record in seqfeature_source.  Potentially, additional
information could be joined onto the seqfeature_source table
without having to replicate it for every feature with a given
source.

Changing this isn't necessarily /wrong/.  But it does feel
like a (small) step backwards to me.  Especially since
the other baseline feature property (from a Biojava perspective)
the `type' is normalized (originally in seqfeature_key, now moved
to ontology_term).

> The problem is not confined to seqfeature_sources. Think of 
> gene_name annotations for instance. Gene_name goes as ontology_term, 
> but the interesting stuff ends up as a qualifier value in the 
> bioentry/ontology_term association table. Not only is the value a 
> LOB which is not indexable straightforwardly, it also will occur 
> multiple times if it is associated with more than one bioentry 
> (which it in many cases will), and hence obtaining a non-redundant 
> list of gene names is non-trivial. The present solution may look 
> simple, but it's a bad solution. Gene names should go into the 
> ontology_term table instead.
> 
> If seqfeature_source should sit in its own table, so should 
> gene_name. And over time, we'll encounter other things that should 
> as well.

I quite agree with this.  Except /please/ don't call it gene_name.
But I think there are some fairly good arguments for having a
seqfeature_name (or similar) table.  Of course, adding this has
other issues.  It seems to be a many-to-many relationship.  There's
also namespacing issues (which might be solved by strongly encouraging
the use of LSIDs).

<change_of_subject />

Thinking a bit more generally about your changes to BioSQL, and issues
you discussed at BOSC, I've noticed some overlap with the ways we're
talking about handling annotated sequence in BioJava2.  The basic plan
is to separate features (which might be genes, or other objects) from
their mappings onto sequences.  All the type information, and most (all)
of the key-value stuff (which will hopefully be more strongly constrained
by the type system) goes onto the FeatureCard, while the FeatureMapping
stays very simple.  It allows you to build a system which gives
equal weight to `gene-centric' and `sequence-centric' views of
your annotation (unlike BioJava1, which turns out very strongly
sequence-centric).

I don't know if there's any enthusiasm at all for building this
kind of pattern into the next generation of BioSQL.  But you
might be interested to look over the FeatureCard/FeatureMapping
discussions on the biojava-dev list.  At some point in the
(hopefully fairly near) future, I'll write a summary of this.

    Thomas