[Bioperl-l] GO categories and load_ontology.pl
Hilmar Lapp
hlapp at gnf.org
Wed Mar 17 13:17:30 EST 2004
Annie, I still owe you an answer for your earlier email. I haven't
managed to get to that yet. See below for my response to this one.
On Wednesday, March 17, 2004, at 08:50 AM, Law, Annie wrote:
> It seems that most of the Entries in the term table are of Ontoloy Id
> = 1
> (Gene ontology) and only around 200 entries molecular function,
> biological
> process, and cellular component put together when there are about 16000
> entries in the term table.
> This is only true if I load locuslink into the database.
This is because LocusLink lags behind the latest version of GO in terms
of the release that they use for annotating sequences. I.e., LocusLink
uses some terms which have meanwhile been retired or obsoleted from GO.
Depending on whether they are still in GO's .defs file, they won't be
in your database if you chose to ignore obsoleted entries (which is not
a bad choice at all per se), or they aren't part of GO anymore at all.
LocusLink doesn't give the ontology of GO terms (which would be 'Gene
Ontology'); rather it gives the category. Because a term must have an
ontology associated, the SeqIO LL parser interprets as the ontology
what NCBI really meant to be the category.
You'd have the following choices to proceed.
- Ignore the 200 entries which aren't in Gene Ontology. You're not
going to miss a significant amount of your annotation, and it's
annotation with obsoleted terms anyway.
- Load GO including obsoleted terms, and see with how many non-Gene
Ontology terms that would leave you. If it's a lot less than 200, you
may just want to ignore the rest.
- Build a SeqProcessor module (see Bio::Factory::SeqProcessorI and
Bio::Seq::BaseSeqProcessor) which takes the seq objects as the LL
parser returns them, goes in and retrieves all GO term annotations, and
replaces the ontology for those with 'Gene Ontology.' Then you pass
your SeqProcessor to load_seqdatabase.pl using the --pipeline
command-line option (see the script's POD).
The last option may sound like but is really not a lot of work if you
can program perl. Note, however, that then you still wouldn't have any
relationships for those terms - they simply have been retired.
Depending on what your project is, just ignoring those 200 may be the
most reasonable way to go.
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list