[Open-bio-l] BioSQL schema: some questions

Hilmar Lapp hlapp@gnf.org
Thu, 25 Apr 2002 21:06:51 -0700


I finally managed to create an ERD model reverse-engineered from the DDL (after some de-MySQLisation :) After studying the schema I've got the following questions; some (or all :) may seem dumb as I haven't followed the evolution process, so I may miss that lots of the things I noticed were discussed previously and are there for a reason. Also, I realize that one of the prime motivations was to enable public databank round-tripping, which is not at all my use case; so some of the following things may simply be due to this different angle.

Anyway, I'd appreciate any feedback and/or clarification. Here goes:

1) There are several attributes of type MEDIUMTEXT. Was it chosen because you thought those attributes (e.g., full_lineage, reference_location, reference_title, term_definition, etc.) really become so large (up to 2^24-1 chars)?

2) Table TAXA appears to be unconnected. Is TAXA_ID in BIOENTRY_TAXA supposed to be a foreign key to it?

3) If taxa_id in Bioentry_Taxa is FK to Taxa: since the relationship to Bioentry is identifying and there are no other attributes, it seems the table is superfluous and could be replaced by a nullable FK in Bioentry to Taxa. Were more attributes anticipated for Bioentry_Taxa?

4) I'd add organization and download-URI to Biodatabase (there are only few entries anyway, and it may be useful at some places, like us :), for record-keeping).

5) Why does the UK in Bioentry include entry_version and division, too? I would think that Biodatabase_id and accession should be unique already. (?) (Maybe version does need to be added if one is interested to keep old versions. Division too?)

6) Comment has a separate primary key attribute, but is never referenced. Is it anticipated that it will be referenced in the future? (i.e., will ever anyone want to reference a comment by PK?)

7) Same for Biosequence. (In fact, in local scenarios you do want to be able to reference those probably in order to integrate other data that have sequences.)

8) Biosequence has a seq_version. How is that different from Bioentry.entry_version?

9) Why is there molecule in Biosequence (and not in Bioentry)? I.e., would there be Biosequence entries of different molecule (mRNA, DNA, ...) for a particular Bioentry? If so, this is contradicted by the identifying relationship to Bioentry (there is a UK on the FK).

10) In Seqfeature all attributes except primary key and FK to Bioentry are nullable. This makes it hard to guarantee a way to uniquely identify a record (other than by PK, which may change from db-load to db-load).

11) Similar for Seqfeature_location: since start and end are nullable, what would be the UK other than the PK? Maybe seqfeature_id and location_rank?

12) Seqfeature_relationship has a PK attribute, but is never referenced. Will someone want to reference it by PK?

13) Same for the association table between Dbxref and Ontology_Term (Dbxref_Qualifier_Value).

14) Same for the association table between Dbxref and Bioentry (Bioentry_Direct_Links; the table name should actually be singular for consistency). 

15) There is no hierarchy or relationship between ontology terms. Intentional?

16) Why is seqfeature_source_id nullable in Seqfeature?

17) Aren't Dbxref.dbname and Biodatabase.name redundant? Shouldn't there be a FK?

I'm wondering how I would 'correctly' represent a mapping of, e.g., Celera transcripts (Bioentries?) onto the Ensembl assembly.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp@gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------