[Open-bio-l] BioSQL schema: some questions

Chris Mungall cjm@bdgp.lbl.gov
Fri, 26 Apr 2002 15:40:23 -0700 (PDT)


[I've moved this thread back onto gmod-devel]

On Thu, 25 Apr 2002, Hilmar Lapp wrote:

> I finally managed to create an ERD model reverse-engineered from the DDL
> (after some de-MySQLisation :) After studying the schema I've got the
> following questions; some (or all :) may seem dumb as I haven't followed
> the evolution process, so I may miss that lots of the things I noticed
> were discussed previously and are there for a reason. Also, I realize
> that one of the prime motivations was to enable public databank
> round-tripping, which is not at all my use case; so some of the
> following things may simply be due to this different angle.
>
> Anyway, I'd appreciate any feedback and/or clarification. Here goes:
>
> 1) There are several attributes of type MEDIUMTEXT. Was it chosen
> because you thought those attributes (e.g., full_lineage,
> reference_location, reference_title, term_definition, etc.) really
> become so large (up to 2^24-1 chars)?

I think for all of these we want to avoid a 255 char limit, a smalltext
would probably do. Is there much of a noticeable speed gain between
smalltext/mediumtext?

> 2) Table TAXA appears to be unconnected. Is TAXA_ID in BIOENTRY_TAXA
> supposed to be a foreign key to it?

it should, I fixed this.

> 3) If taxa_id in Bioentry_Taxa is FK to Taxa: since the relationship to
> Bioentry is identifying and there are no other attributes, it seems the
> table is superfluous and could be replaced by a nullable FK in Bioentry
> to Taxa. Were more attributes anticipated for Bioentry_Taxa?

good point.

it should probably remain as it is, it's not doing any harm, it's possible
you get a slight performance gain for certain queries,

> 4) I'd add organization and download-URI to Biodatabase (there are only
> few entries anyway, and it may be useful at some places, like us :), for
> record-keeping).

I'd go for a biodatabase_qualifier_value table here

> 5) Why does the UK in Bioentry include entry_version and division, too?
> I would think that Biodatabase_id and accession should be unique
> already. (?) (Maybe version does need to be added if one is interested
> to keep old versions. Division too?)

I think the version is required, but I don't know about division

> 6) Comment has a separate primary key attribute, but is never
> referenced. Is it anticipated that it will be referenced in the future?
> (i.e., will ever anyone want to reference a comment by PK?)

I don't think there's any real need for this PK.

I've just added  UNIQUE(bioentry_id, comment_rank)

I think the ability to link comments to seqfeatures would be useful, to
do this you'd have to use the seqfeature_qualifier_value table rather
than the comment table.

> 7) Same for Biosequence. (In fact, in local scenarios you do want to be
> able to reference those probably in order to integrate other data that
> have sequences.)

yes, i think it is useful having a seperate PK attribute here

> 8) Biosequence has a seq_version. How is that different from
> Bioentry.entry_version?

pass

> 9) Why is there molecule in Biosequence (and not in Bioentry)? I.e.,
> would there be Biosequence entries of different molecule (mRNA, DNA,
> ...) for a particular Bioentry? If so, this is contradicted by the
> identifying relationship to Bioentry (there is a UK on the FK).

pass; there was a thread on bioperl a while ago about molecule type vs
alphabet

> 10) In Seqfeature all attributes except primary key and FK to Bioentry
> are nullable. This makes it hard to guarantee a way to uniquely identify
> a record (other than by PK, which may change from db-load to db-load).

Looking at it from a genbank loading point of view, this makes sense.

If you want features to persist then they should have their own bioentry.

Individual projects may wish to have their own decisions about ways to
uniquely identify seqfeatures (dbxrefs, 'name' qualifiers) but we can't
enforce this at the relational level without breaking genbank mode

> 11) Similar for Seqfeature_location: since start and end are nullable,
> what would be the UK other than the PK? Maybe seqfeature_id and
> location_rank?

I'll add
UNIQUE (seqfeature_id, location_rank),

> 12) Seqfeature_relationship has a PK attribute, but is never referenced.
> Will someone want to reference it by PK?

Quite possibly.

I had envisioned the semantics of seqfeature_relationship being left open.
Mostly it will be used to specify compositional relationships.

You could use it in combination an ontology to specify other kinds of
relationships (e.g. P-insertion X disrupts gene Y); in some of these
cases, you may want to record extra information about the relationship
(e.g. who made the association and when)

> 13) Same for the association table between Dbxref and Ontology_Term
> (Dbxref_Qualifier_Value).

dbxref_qualifier_value_id isn't really useful as far as I can see

> 14) Same for the association table between Dbxref and Bioentry
> (Bioentry_Direct_Links; the table name should actually be singular for
> consistency).

yep

> 15) There is no hierarchy or relationship between ontology terms.
> Intentional?

this is here - as a seperate component, under sql/ontology/

right now the db is built from the components via a makefile, which also
takes care of mysql/pg conversion.

I think it may be a good idea to further break down the schema into
components; I don't know if makefiles are the best long term solution for
specifying how to combine the components.

is there a standard way of specifying this, or shall we make up our own.

Sounds like a good excuse for a tab vs xml war....

> 16) Why is seqfeature_source_id nullable in Seqfeature?

pass

> 17) Aren't Dbxref.dbname and Biodatabase.name redundant? Shouldn't there
> be a FK?

pass

> I'm wondering how I would 'correctly' represent a mapping of, e.g.,
> Celera transcripts (Bioentries?) onto the Ensembl assembly.

We need a similarity-pair table for this - shall I make one?

How should we deal with scores, e-vals etc? Using a qualifier-value system
is generic and can be extended for a variety of programs and metrics. But
then we lose the ability to use floating point arithmetic at the DBMS
level.

We could have tables:

featurepair_qvalue_float

featurepair_qvalue_int

featurepair_qvalue_text

but this seems a bit ugly.

In gadfly the featurepair table has the common qualifiers (score, e-val,
qframe, sframe), and a qualifier-value system is used for the less common
ones. I think this is a good solution; it breaks the generic biosql model
but querying by e-val is so useful and common I think it's OK

Or do we allow different implementations here?

> 	-hilmar
>