[BioSQL-l] BioSQL and ontology "standards".

Thu Dec 4 16:38:05 UTC 2008

Can I just add comment to refocus discussion here: In part its probably
my fault, but I think we're starting to mix two (as I see it) distinct
aspects of ontologies within the BioSQL schema:
Leighton Pritchard wrote:
> On 04/12/2008 15:04, "Peter" <biopython at maubp.freeserve.co.uk> wrote:
>>> With apologies if I'm misinterpreting the tide of discussion, but I would be
>>> disappointed to see a default behaviour of "bung everything under
>>> 'Annotation Tags', typos and all" become a 'standard' of any sort, rather
>>> than a placeholder for future development of ontology-aware Bio* code that
>>> queries and populates BioSQL appropriately.
>> Overall, I agree.  It isn't ideal, but the current ad-hoc "ontology"
>> is useful in that its looseness allows any parsable GenBank file to be
>> imported into the database.
> 
> I think that this may be a matter of perspective: you see an advantage, I
> see an accident waiting to happen ;)
I think this aspect of BioSQL ontology is strictly semantic, and can
probably only be handled when interpreting the data retrieved from a
BioSQL database in specific context. Within a particular type of
datamodel, inconsistencies arise in the free-text derived from flat-file
data records. These inconsistencies (such as typos, synonyms, case
variation, etc) are really aspects to be addressed by data cleaning
prior to insertion into the database. If you don't care about putting
dirty data into a bioSQL database, then you should still be able to do
it, but you should then not expect someone else to connect to the
database and make perfect sense of the data. In particular, don't expect
a program to magically interpret your mis-annotated 'eXons' as coding
regions, for instance, or your disulphide bonds as disulfide (or vice
versa).

The ramification of this is that in practice, clients that ultimately
consume and interpret any kind of BioSQL datastore have to have some
form of robustness built in. This would be in the same way that file
parsers have to cope with the known variations of freetext feature tags
in genbank records. In this situation one assumes that the client at
least understands that particular terms in the biosql annotation really
are freetext feature tags, and that brings us to the other aspect, which
in comparison is much more prosaic.

The aspect that I was talking about is 'structural' consistency rather
than semantic consistency (for want of a better word). For instance, a
bioperl Generic Feature has a 'score' attribute - this should map to the
'score' attribute on a BioPython generic feature, and also to a biojava
score attribute. As far as I understand it (and I may be wrong here),
hierarchical relationships are faithfully preserved when a bio* feature
is persisted in BioSQL - and I'd hope all the attributes were too. This
kind of thing is definitely worth writing down, and even making test
cases against!

Just to duplicate what Hilmar wrote in response to the Bio* binding
comment here:
>>>> ps. on a side issue - have the various Bio* language bindings actually
>>> been specified formally ?  If so - where might I find them ?
>>>
>>
>> I think the answer to that is sadly a no.
>
>
> I agree (with both the sadly and the no). Maybe I have New Years
> resolution coming at me here ...
>
> Indeed though, this needs addressing. I think I was about to start
> something on the wiki and then got sucked elsewhere. If anyone has
> energy to start this, please don't wait - wiki allows account creation
> by anyone.
I guess we (at least Peter, Myself, and anyone else) should get to it
then. At least for the structural mapping betwen biosql terms and
feature structure!

as for the semantic ontology aspect - is there a way that one might tag
a biosql database as using external ontologies ?

Jim