[Open-bio-l] BioSQL schema: some questions

Ewan Birney birney@ebi.ac.uk
Sat, 27 Apr 2002 12:07:44 +0100 (BST)


On Fri, 26 Apr 2002, Chris Mungall wrote:

> > 8) Biosequence has a seq_version. How is that different from
> > Bioentry.entry_version?
> 
> pass

sequence versions and entry versions have different semantics - sequecne
version is hte important one and changes on sequence changes. entry
verions changes on sequence changes and any other (eg,
annotation) changes.

Standard embl/genbank stuff.

> 
> > 9) Why is there molecule in Biosequence (and not in Bioentry)? I.e.,
> > would there be Biosequence entries of different molecule (mRNA, DNA,
> > ...) for a particular Bioentry? If so, this is contradicted by the
> > identifying relationship to Bioentry (there is a UK on the FK).
> 
> pass; there was a thread on bioperl a while ago about molecule type vs
> alphabet
> 

molecule = where it came from (eg mRNA)

alphabet = how it is encoded (DNA/RNA etc)

> > 10) In Seqfeature all attributes except primary key and FK to Bioentry
> > are nullable. This makes it hard to guarantee a way to uniquely identify
> > a record (other than by PK, which may change from db-load to db-load).
> 
> Looking at it from a genbank loading point of view, this makes sense.
> 
> If you want features to persist then they should have their own bioentry.
> 
> Individual projects may wish to have their own decisions about ways to
> uniquely identify seqfeatures (dbxrefs, 'name' qualifiers) but we can't
> enforce this at the relational level without breaking genbank mode
> 
> > 11) Similar for Seqfeature_location: since start and end are nullable,
> > what would be the UK other than the PK? Maybe seqfeature_id and
> > location_rank?
> 
> I'll add
> UNIQUE (seqfeature_id, location_rank),
> 
> > 12) Seqfeature_relationship has a PK attribute, but is never referenced.
> > Will someone want to reference it by PK?
> 
> Quite possibly.
> 
> I had envisioned the semantics of seqfeature_relationship being left open.
> Mostly it will be used to specify compositional relationships.
> 
> You could use it in combination an ontology to specify other kinds of
> relationships (e.g. P-insertion X disrupts gene Y); in some of these
> cases, you may want to record extra information about the relationship
> (e.g. who made the association and when)
> 
> > 13) Same for the association table between Dbxref and Ontology_Term
> > (Dbxref_Qualifier_Value).
> 
> dbxref_qualifier_value_id isn't really useful as far as I can see
> 
> > 14) Same for the association table between Dbxref and Bioentry
> > (Bioentry_Direct_Links; the table name should actually be singular for
> > consistency).
> 
> yep
> 
> > 15) There is no hierarchy or relationship between ontology terms.
> > Intentional?
> 
> this is here - as a seperate component, under sql/ontology/
> 
> right now the db is built from the components via a makefile, which also
> takes care of mysql/pg conversion.
> 
> I think it may be a good idea to further break down the schema into
> components; I don't know if makefiles are the best long term solution for
> specifying how to combine the components.
> 
> is there a standard way of specifying this, or shall we make up our own.
> 
> Sounds like a good excuse for a tab vs xml war....
> 
> > 16) Why is seqfeature_source_id nullable in Seqfeature?
> 
> pass
> 
Probably an oversihgt

> > 17) Aren't Dbxref.dbname and Biodatabase.name redundant? Shouldn't there
> > be a FK?
> 
> pass
> 

Not dbref.dbnames will be biodatabase.name, in particular in things like
swissport which going link-tastic v. quickly.

> > I'm wondering how I would 'correctly' represent a mapping of, e.g.,
> > Celera transcripts (Bioentries?) onto the Ensembl assembly.
> 
> We need a similarity-pair table for this - shall I make one?
> 
> How should we deal with scores, e-vals etc? Using a qualifier-value system
> is generic and can be extended for a variety of programs and metrics. But
> then we lose the ability to use floating point arithmetic at the DBMS
> level.
> 
> We could have tables:
> 
> featurepair_qvalue_float
> 
> featurepair_qvalue_int
> 
> featurepair_qvalue_text
> 
> but this seems a bit ugly.
> 
> In gadfly the featurepair table has the common qualifiers (score, e-val,
> qframe, sframe), and a qualifier-value system is used for the less common
> ones. I think this is a good solution; it breaks the generic biosql model
> but querying by e-val is so useful and common I think it's OK
> 
> Or do we allow different implementations here?
> 
> > 	-hilmar
> >
> 
> 
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l@open-bio.org
> http://open-bio.org/mailman/listinfo/open-bio-l
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------