[Bioperl-l] GO categories and load_ontology.pl

Wed Mar 17 13:17:30 EST 2004

Annie, I still owe you an answer for your earlier email. I haven't 
managed to get to that yet. See below for my response to this one.

On Wednesday, March 17, 2004, at 08:50  AM, Law, Annie wrote:

> It seems that most of the Entries in the term table are of Ontoloy Id 
> = 1
> (Gene ontology) and only around 200 entries molecular function, 
> biological
> process, and cellular component put together when there are about 16000
> entries in the term table.
> This is only true if I load locuslink into the database.

This is because LocusLink lags behind the latest version of GO in terms 
of the release that they use for annotating sequences. I.e., LocusLink 
uses some terms which have meanwhile been retired or obsoleted from GO. 
Depending on whether they are still in GO's .defs file, they won't be 
in your database if you chose to ignore obsoleted entries (which is not 
a bad choice at all per se), or they aren't part of GO anymore at all.

LocusLink doesn't give the ontology of GO terms (which would be 'Gene 
Ontology'); rather it gives the category. Because a term must have an 
ontology associated, the SeqIO LL parser interprets as the ontology 
what NCBI really meant to be the category.

You'd have the following choices to proceed.

	- Ignore the 200 entries which aren't in Gene Ontology. You're not 
going to miss a significant amount of your annotation, and it's 
annotation with obsoleted terms anyway.

	- Load GO including obsoleted terms, and see with how many non-Gene 
Ontology terms that would leave you. If it's a lot less than 200, you 
may just want to ignore the rest.

	- Build a SeqProcessor module (see Bio::Factory::SeqProcessorI and 
Bio::Seq::BaseSeqProcessor) which takes the seq objects as the LL 
parser returns them, goes in and retrieves all GO term annotations, and 
replaces the ontology for those with 'Gene Ontology.' Then you pass 
your SeqProcessor to load_seqdatabase.pl using the --pipeline 
command-line option (see the script's POD).

The last option may sound like but is really not a lot of work if you 
can program perl. Note, however, that then you still wouldn't have any 
relationships for those terms - they simply have been retired.

Depending on what your project is, just ignoring those 200 may be the 
most reasonable way to go.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------