[BioSQL-l] Seqfeature_Source

Hilmar Lapp hlapp@gnf.org
Wed, 18 Sep 2002 11:53:39 -0700


(I hope I'm not offending you by bringing this back to the list -- I 
guess you just forgot to reply-all)

On Wednesday, September 18, 2002, at 03:08 AM, Thomas Down wrote:

> On Thu, Sep 12, 2002 at 12:52:56PM -0700, Hilmar Lapp wrote:
>> I'm not sure what was the idea behind this single-column table.
>> Unless I'm missing something, I propose to collapse this table into
>> a column on Seqfeature for simplicity reasons (makes both the schema
>> and adaptor code simpler, and can only improve performance).
>
> I know this may be a bit late to reply -- I'm afraid your original
> message got lost in the open-bio network outage.
>
> Could I put in an argument against this change.  The original
> BioSQL design was `normalized everything', and I think it would
> be a shame to move away from this.

Theoretically I agree, but in practice one thing I learnt over the 
years repeatedly is that fully-normalized sounds good, but sometimes 
doesn't work well. I'm saying this here generally speaking; I'm 
usually a normalization-advocate.

What I think needs to be normalized away is a) what constitutes 
separate entities, and b) to ease enforcing controlled vocabularies.

Seqfeature_Source certainly doesn't fall into the first category 
(it's not an object in Bioperl either, nor in any other Bio* project 
AFAIK). As you're correctly pointing out below, it may fall into the 
second though.

> In particular, if you make
> the seqfeature.source column a char() or varchar(), you're left
> with questions like what character limit to use (255 seems to be
> the BioSQL standard, IIRC).

I think based on my experience this should almost never be a 
consideration regarding whether or not to separate out columns into 
artificial entities. First, because the entity then is purely 
artificial, and second because we should leave it to RDBMS vendors 
to optimize handling of columns that may be empty many times, or 
have varying lengths.

In fact, even MySQL handles VARCHAR() types quite efficiently, at 
least according to the docs: it only occupies l+1 bytes, l being the 
actual length of the value stored, not the maximum length. Other 
RDBMSs I know do at least that.


>
> I'm also not 100% convinced that this /will/ make things faster.
> RDBMS in general really like looking things up by indexed integer
> columns.  Might be something worth benchmarking.
>

The join doesn't absolve you from looking up the name if you want to 
look it up, right? Conversely, if you don't, joining doesn't make it 
faster either, right?

> Couple of note:
>
>    - I'm not hard nosed about this, and am open to persuasion
>      (especially if it really does make things run a lot faster).
>

My point is not only about this being potentially faster. My point 
was more targeted at the present design being non-intuitive, and not 
justified by any of the Bio* object models AFAIK.


>    - Something which I proposed at Cape Town, but was dropped at
>      the time, was to re-use the ontology_term mechanism for
>      sources, since
>
>          a) ontology_term is used in all the other places BioSQL
>             wants controlled vocabulary stuff.
>

That's exactly what I thought converting it to. If not an attribute, 
then I think this is what it should be.

The only disadvantage implementation-wise is that then you have two 
FKs from seqfeature to ontology_term, hence you have to name them 
differently and auto-determining their name is not as standard 
anymore. I'd like to get the whole thing working before that 
change ...

Another possibility is to turn the ontology_term - seqfeature FK 
into an association.

	-hilmar

>          b) in the future, I can see people who run big analysis
>             pipelines with large numbers of analyses creating 
> ontologies
>             to define all the different source fields in their
>             database.
>
>
>
> Thomas.
>
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------