[BioSQL-l] BioSQL and ontology "standards".

Leighton Pritchard lpritc at scri.ac.uk
Thu Dec 4 13:25:50 UTC 2008


On 04/12/2008 11:45, "James Procter" <jimp at compbio.dundee.ac.uk> wrote:
> Peter wrote:
>> On Fri, Nov 28, 2008 at 7:16 PM, Richard Holland wrote:

>>> I think the best approach is to always to use what the file says, and
>>> trust that it's accurate. What needs to be agreed between projects is
>>> any additional annotations that get introduced outside the context of
>>> file parsing, and the names of the ontologies used for the file
>>> annotations so that all projects use the same ontologies and don't
>>> replicate them inside the BioSQL database. It would be nice to
>>> standardise these names and the additional custom terms across the
>>> projects, in much the same way as people tried already to standardise
>>> the way general objects get mapped to BioSQL.
>> 
>> This is what I am trying to get at here - documenting the existing "ad
>> hoc" ontology usage.  My impression is that it has not been
>> documented, and that the BioPerl behaviour is the defacto BioSQL
>> standard.
>> 
>> I'd like to pin down this standard, and extend it for situations like
>> the location_qualifier_value.term_id and perhaps location.term_id
>> where BioPerl seems to ignore the ontology issue.

Hi,

Just to add some of my experience with BioSQL and Biopython to the
discussion...

When I began to look at this issue a couple of years ago, it was clear that
the Biopython loader (and, to the best of my knowledge, Bioperl does this,
too) for GenBank files and BioSQL put pretty much everything under an
'ontology' called 'Annotation Tags', with no definitions and only
rudimentary error-checking.

Now, BioSQL seems to have taken great care to ensure that, whatever one's
choice of ontology, it can be accommodated in the database schema.  There
is, as far as I can tell, no reason to favour one ontology over another on
the grounds of BioSQL compatibility and, if anything, the BioSQL schema
pretty much forced me to start considering ontologies in a serious manner.
My understanding is that BioSQL is ontology-neutral, and that the
appropriate choice of ontology is dependent on the data with which you want
to populate your database.

This suggests to me that the Bio* loaders are the things that need to be
dynamically ontology-aware, first to check if the appropriate ontology (as
selected by the user) for the data is present in the database, and then to
populate the database using those ontology terms, calling errors as
appropriate (e.g. for extraneous terms, mis-spellings, inappropriate data
types, etc.).

If your reason, like mine, for using an ontology is either to ensure that
annotation terms have well-defined (or at least defined) meanings, and
perhaps incidentally to carry out a check on the validity of a particular
annotation file within the domain of that ontology, then that can readily be
done in BioSQL.  I have managed this with both the Gene Ontology and
Sequence Ontology ontologies, and locally-defined ontologies.  BioSQL copes
with these very nicely, as does a modified Biopython Loader.py.

However, the current Biopython (and AFIAA Bio*) behaviour with 'Annotation
Tags' doesn't correspond well to the above.  I think that this is a bad
thing in general, and that there is room for improvement, if we want it.

With apologies if I'm misinterpreting the tide of discussion, but I would be
disappointed to see a default behaviour of "bung everything under
'Annotation Tags', typos and all" become a 'standard' of any sort, rather
than a placeholder for future development of ontology-aware Bio* code that
queries and populates BioSQL appropriately.

I see the situation as pretty much analogous to the effective requirement
for NCBI taxon data in BioSQL, when using Biopython: you need to load in the
NCBI taxon data before your own data can be imported in a taxon-aware
manner. I would prefer to see a similar, but perhaps even more draconian
imposition of requiring an appropriate ontology (or ontologies) to be
present in the database before importing data, and a specification of which
ontology/ies is/are to be used when loading the data.  Then, where a term is
not yet known to an ontology in BioSQL, this might be an error in the source
data, or an oversight of the ontology.  Correcting either of these improves
the quality of the data and/or its description.  The catch-all 'Annotation
Tag' 'ontology' seems to silently record a new term with a different ID,
permits no error correction and, for my own part, I would rather this
behaviour went away, eventually.

Cheers,

L.


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________



More information about the BioSQL-l mailing list