[BioSQL-l] Seqfeature_Source
Hilmar Lapp
hlapp@gnf.org
Wed, 18 Sep 2002 11:53:39 -0700
(I hope I'm not offending you by bringing this back to the list -- I
guess you just forgot to reply-all)
On Wednesday, September 18, 2002, at 03:08 AM, Thomas Down wrote:
> On Thu, Sep 12, 2002 at 12:52:56PM -0700, Hilmar Lapp wrote:
>> I'm not sure what was the idea behind this single-column table.
>> Unless I'm missing something, I propose to collapse this table into
>> a column on Seqfeature for simplicity reasons (makes both the schema
>> and adaptor code simpler, and can only improve performance).
>
> I know this may be a bit late to reply -- I'm afraid your original
> message got lost in the open-bio network outage.
>
> Could I put in an argument against this change. The original
> BioSQL design was `normalized everything', and I think it would
> be a shame to move away from this.
Theoretically I agree, but in practice one thing I learnt over the
years repeatedly is that fully-normalized sounds good, but sometimes
doesn't work well. I'm saying this here generally speaking; I'm
usually a normalization-advocate.
What I think needs to be normalized away is a) what constitutes
separate entities, and b) to ease enforcing controlled vocabularies.
Seqfeature_Source certainly doesn't fall into the first category
(it's not an object in Bioperl either, nor in any other Bio* project
AFAIK). As you're correctly pointing out below, it may fall into the
second though.
> In particular, if you make
> the seqfeature.source column a char() or varchar(), you're left
> with questions like what character limit to use (255 seems to be
> the BioSQL standard, IIRC).
I think based on my experience this should almost never be a
consideration regarding whether or not to separate out columns into
artificial entities. First, because the entity then is purely
artificial, and second because we should leave it to RDBMS vendors
to optimize handling of columns that may be empty many times, or
have varying lengths.
In fact, even MySQL handles VARCHAR() types quite efficiently, at
least according to the docs: it only occupies l+1 bytes, l being the
actual length of the value stored, not the maximum length. Other
RDBMSs I know do at least that.
>
> I'm also not 100% convinced that this /will/ make things faster.
> RDBMS in general really like looking things up by indexed integer
> columns. Might be something worth benchmarking.
>
The join doesn't absolve you from looking up the name if you want to
look it up, right? Conversely, if you don't, joining doesn't make it
faster either, right?
> Couple of note:
>
> - I'm not hard nosed about this, and am open to persuasion
> (especially if it really does make things run a lot faster).
>
My point is not only about this being potentially faster. My point
was more targeted at the present design being non-intuitive, and not
justified by any of the Bio* object models AFAIK.
> - Something which I proposed at Cape Town, but was dropped at
> the time, was to re-use the ontology_term mechanism for
> sources, since
>
> a) ontology_term is used in all the other places BioSQL
> wants controlled vocabulary stuff.
>
That's exactly what I thought converting it to. If not an attribute,
then I think this is what it should be.
The only disadvantage implementation-wise is that then you have two
FKs from seqfeature to ontology_term, hence you have to name them
differently and auto-determining their name is not as standard
anymore. I'd like to get the whole thing working before that
change ...
Another possibility is to turn the ontology_term - seqfeature FK
into an association.
-hilmar
> b) in the future, I can see people who run big analysis
> pipelines with large numbers of analyses creating
> ontologies
> to define all the different source fields in their
> database.
>
>
>
> Thomas.
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------