[BioSQL-l] location type

Hilmar Lapp hlapp@gnf.org
Tue, 15 Oct 2002 01:52:43 -0700


On Tuesday, October 15, 2002, at 01:11 AM, Thomas Down wrote:

> On Sat, Oct 12, 2002 at 04:17:59PM -0700, Hilmar Lapp wrote:
>> For certain location types, specifically fuzzy locations, the
>> additional attributes (min_start etc) are stored as
>> Location_Qualifier_Value entries.
>>
>> If you don't know in advance whether a location is a fuzzy location,
>> you need to make an extra hit to the database for every location
>> just to find out most of the time that there are no extra attributes.
>>
>> To alleviate this I propose to add to SeqFeature_Location a FK to
>> Ontology_Term denoting the type of the location. We'd need to agree
>> on a standard ontology for location types too. E.g.,
>>
>> 	FuzzyLocation
>> 	SplitLocation
>> 	ExactLocation
>
> I see duplicated information :-(.
>
> The way I handle this in BioJava is to do three queries every
> time I fetch a block of features (depending on circumstances,
> this might be all features on a bioentry, all features overlapping
> a sequence interval, or all child features of a given parent --
> all three cases go through the same Java code, with slightly
> different SQL queries):
>
>    - Fetch all interesting features, and put mementos in a Map
>      keyed on seqfeature_ids.
>
>    - Fetch all location_qualifier_values for all interesting
>      features (yes, in a single query).  Build in-memory memento
>      objects, and put in a Map keyed on location_ids.
>
>    - Fetch all location spans.  As each one is fetched, I do
>      an in-memory lookup of its qualifiers.
>
> Finally, the location spans get grouped together and attached
> to the Feature.Templates.
>
> Actually, things are a little more complex than this because
> of the feature hierarchy, but you get the general idea.
>
> I guess you could argue that I'm not taking maximum advantage
> of the database engine by doing things this way.  But it's not
> too bad to implement in practice, and scales well to large
> numbers of seqfeatures per request.
>
> Might this sort of design be a valid alternative to
> denormalization?
>

I agree you can try to optimize this in software using various 
approaches. My problem with this is it necessitates specialized code 
for query construction, query processing, and object construction, 
as you say. With a strict separation of schema code from object code 
this is somewhat tedious, bloats the code, makes the feature and 
location adaptors break the otherwise uniform pattern, and hence 
generally increases code complexity considerably. And all this just 
because there are fuzzy locations...

What I suggested is not really denormalization. You're right though, 
there's duplication. If the coordinate types (which I mentioned 
further down in the email in the part you cut) are just moved from 
being location_qualifier_value associations to FKs on location, then 
there isn't really duplication anymore either. Or am I missing 
something?

I.e., you'd have 3 FKs on Location to Ontology_Term:

- location type
- start position type
- end position type

Location_qualifier_value would optionally hold min_start, max_start 
and so forth, iff it is a fuzzy location.

You'd need the 3 types (location,start,end) for every kind of 
location anyway, wouldn't you? So, the types aren't really optional. 
The advantage of having them as FKs is streamlined behaviour and 
less space (in the association you need the location FK in addition) 
on one hand, and you easily speed this up by caching ontology terms 
by primary key (the find_by_primary_key() operation transparently 
going to the cache if there is one; that's one  facet of my design 
for bioperl-db).

So, what about solving it this way?

	-hilmar

--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------