[BioSQL-l] location type
Hilmar Lapp
hlapp@gnf.org
Tue, 15 Oct 2002 01:52:43 -0700
On Tuesday, October 15, 2002, at 01:11 AM, Thomas Down wrote:
> On Sat, Oct 12, 2002 at 04:17:59PM -0700, Hilmar Lapp wrote:
>> For certain location types, specifically fuzzy locations, the
>> additional attributes (min_start etc) are stored as
>> Location_Qualifier_Value entries.
>>
>> If you don't know in advance whether a location is a fuzzy location,
>> you need to make an extra hit to the database for every location
>> just to find out most of the time that there are no extra attributes.
>>
>> To alleviate this I propose to add to SeqFeature_Location a FK to
>> Ontology_Term denoting the type of the location. We'd need to agree
>> on a standard ontology for location types too. E.g.,
>>
>> FuzzyLocation
>> SplitLocation
>> ExactLocation
>
> I see duplicated information :-(.
>
> The way I handle this in BioJava is to do three queries every
> time I fetch a block of features (depending on circumstances,
> this might be all features on a bioentry, all features overlapping
> a sequence interval, or all child features of a given parent --
> all three cases go through the same Java code, with slightly
> different SQL queries):
>
> - Fetch all interesting features, and put mementos in a Map
> keyed on seqfeature_ids.
>
> - Fetch all location_qualifier_values for all interesting
> features (yes, in a single query). Build in-memory memento
> objects, and put in a Map keyed on location_ids.
>
> - Fetch all location spans. As each one is fetched, I do
> an in-memory lookup of its qualifiers.
>
> Finally, the location spans get grouped together and attached
> to the Feature.Templates.
>
> Actually, things are a little more complex than this because
> of the feature hierarchy, but you get the general idea.
>
> I guess you could argue that I'm not taking maximum advantage
> of the database engine by doing things this way. But it's not
> too bad to implement in practice, and scales well to large
> numbers of seqfeatures per request.
>
> Might this sort of design be a valid alternative to
> denormalization?
>
I agree you can try to optimize this in software using various
approaches. My problem with this is it necessitates specialized code
for query construction, query processing, and object construction,
as you say. With a strict separation of schema code from object code
this is somewhat tedious, bloats the code, makes the feature and
location adaptors break the otherwise uniform pattern, and hence
generally increases code complexity considerably. And all this just
because there are fuzzy locations...
What I suggested is not really denormalization. You're right though,
there's duplication. If the coordinate types (which I mentioned
further down in the email in the part you cut) are just moved from
being location_qualifier_value associations to FKs on location, then
there isn't really duplication anymore either. Or am I missing
something?
I.e., you'd have 3 FKs on Location to Ontology_Term:
- location type
- start position type
- end position type
Location_qualifier_value would optionally hold min_start, max_start
and so forth, iff it is a fuzzy location.
You'd need the 3 types (location,start,end) for every kind of
location anyway, wouldn't you? So, the types aren't really optional.
The advantage of having them as FKs is streamlined behaviour and
less space (in the association you need the location FK in addition)
on one hand, and you easily speed this up by caching ontology terms
by primary key (the find_by_primary_key() operation transparently
going to the cache if there is one; that's one facet of my design
for bioperl-db).
So, what about solving it this way?
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------