[Open-bio-l] BioSQL schema: some questions

Fri, 26 Apr 2002 18:22:10 -0700

> -----Original Message-----
> From: Chris Mungall [mailto:cjm@bdgp.lbl.gov]
> Sent: Friday, April 26, 2002 3:40 PM
> To: Hilmar Lapp
> Cc: OBDA BioSQL (E-mail); gmod-devel@lists.sourceforge.net
> Subject: Re: [Open-bio-l] BioSQL schema: some questions
> 
> 
> 
> [I've moved this thread back onto gmod-devel]
> 
> On Thu, 25 Apr 2002, Hilmar Lapp wrote:
> 
> >
> > 1) There are several attributes of type MEDIUMTEXT. Was it chosen
> > because you thought those attributes (e.g., full_lineage,
> > reference_location, reference_title, term_definition, etc.) really
> > become so large (up to 2^24-1 chars)?
> 
> I think for all of these we want to avoid a 255 char limit, a 
> smalltext
> would probably do. Is there much of a noticeable speed gain between
> smalltext/mediumtext?

I don't know for MySQL. In Oracle you don't want to use a LOB type unless you have to (they are not supported for every operation 'normal' types are, you have to go through functions for replace/substr/length etc, and DBI and JDBC often require you to stream those instead of plain getString(field_num); btw isn't that quite similar for the MySQL drivers?), and VARCHAR2 goes up to 4000 chars. So, I was asking what max length you really anticipated beyond 255, so whether or not I could still convert these to VARCHAR.

> 
> > 2) Table TAXA appears to be unconnected. Is TAXA_ID in BIOENTRY_TAXA
> > supposed to be a foreign key to it?
> 
> it should, I fixed this.
> 
> > 3) If taxa_id in Bioentry_Taxa is FK to Taxa: since the 
> relationship to
> > Bioentry is identifying and there are no other attributes, 
> it seems the
> > table is superfluous and could be replaced by a nullable FK 
> in Bioentry
> > to Taxa. Were more attributes anticipated for Bioentry_Taxa?
> 
> good point.
> 
> it should probably remain as it is, it's not doing any harm, 
> it's possible
> you get a slight performance gain for certain queries,

I just personally like to remove things which aren't needed; I don't think it would give you a noticeable performance gain.

> 
> > 4) I'd add organization and download-URI to Biodatabase 
> (there are only
> > few entries anyway, and it may be useful at some places, 
> like us :), for
> > record-keeping).
> 
> I'd go for a biodatabase_qualifier_value table here

So, rather the Extendable pattern (can one call a pattern which is not a GoF pattern :)

[...]
> 
> > 9) Why is there molecule in Biosequence (and not in Bioentry)? I.e.,
> > would there be Biosequence entries of different molecule (mRNA, DNA,
> > ...) for a particular Bioentry? If so, this is contradicted by the
> > identifying relationship to Bioentry (there is a UK on the FK).
> 
> pass; there was a thread on bioperl a while ago about molecule type vs
> alphabet

I actually liked the change to 'alphabet'. Especially if it means that and not molecule type. I think that was the conflict in Bio::PrimarySeq. (Jason, are you on this list? Is this true?)

> 
> > 10) In Seqfeature all attributes except primary key and FK 
> to Bioentry
> > are nullable. This makes it hard to guarantee a way to 
> uniquely identify
> > a record (other than by PK, which may change from db-load 
> to db-load).
> 
> Looking at it from a genbank loading point of view, this makes sense.
> 
> If you want features to persist then they should have their 
> own bioentry.

So you can decide which 'features' become bioentries in their own right. Would there be a guideline or so, as this could get potentially confusing?

[...]
> 
> > I'm wondering how I would 'correctly' represent a mapping of, e.g.,
> > Celera transcripts (Bioentries?) onto the Ensembl assembly.
> 
> We need a similarity-pair table for this - shall I make one?

I'd say yes. Would that link bioentries, seqfeatures, or both, or both as a cross-product? (The latter seems to be right to me -- a similarity between any of two of {bioentries, seqfeatures}.)

> 
> How should we deal with scores, e-vals etc? Using a 
> qualifier-value system
> is generic and can be extended for a variety of programs and 
> metrics. But
> then we lose the ability to use floating point arithmetic at the DBMS
> level.
> 
> We could have tables:
> 
> featurepair_qvalue_float
> 
> featurepair_qvalue_int
> 
> featurepair_qvalue_text
> 
> but this seems a bit ugly.
> 
> In gadfly the featurepair table has the common qualifiers 
> (score, e-val,
> qframe, sframe), and a qualifier-value system is used for the 
> less common
> ones. I think this is a good solution; it breaks the generic 
> biosql model
> but querying by e-val is so useful and common I think it's OK

A relational schema for MAGE data (gene expression) faces similar problems (various types of quantitation, which are at the same time different from one technology to another and constitutes the by far most voluminous table). One of the suggestions, which is employed successfully at a couple places, was to go for a mixed solution with both a set of the most common values as fixed attributes and an additional generic 'overflow' table that allows for any number of additional values, but at a price. Even though it breaks the genericity rule, it makes a lot of sense to me and I kind-of liked it.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp@gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------