[BioSQL-l] Problem loading GO.

Hilmar Lapp hlapp at gmx.net
Tue Apr 17 15:09:45 UTC 2007


On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote:

> Hi Hilmar,
>
> Thanks for the very quick response.  Apologies for the long reply,  
> but I
> thought it might be useful if anyone else happens across the same
> problems that I did.

Thanks for reporting all these.

> [...]
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
> were ("","","0","") FKs ()
> Column 'dbname' cannot be null
> ---------------------------------------------------
> Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate
> lactonase activity':
> [...]
> I tracked this down to an apparently poor formatting of the GO.defs  
> file
> (note that the first and third definition_lines appear to be two  
> halves
> of the same entry):
>
> term: 2-pyrone-4,6-dicarboxylate lactonase activity
> goid: GO:0047554
> definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate +  
> H2O
> = 4-carboxy-2-hydroxyhexa-2,4-dienedioate.
> definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN

I wonder whether this is the line that throws the parser off. It  
looks like the database part of the reference is missing - bad.

> definition_reference: EC:3.1.1.57
> definition_reference: MetaCyc:2-PYRONE-4
>
> I found 43 similar errors for other GOIDs, and it appears to result  
> from
> the occurrence of the string "\," in a dbxref - mostly MetaCyc  
> entries,
> but also some UM-BBD_pathwayID entries.

I'm not sure - although the string "\," might indeed trip up the  
parser, would have to investigate to confirm. Could it be a  
coincidence with definition_references that lack the database part  
before the colon?

>
> These errors appear to have followed through into the generation of  
> the
> OBO format files in each case, e.g.:
>
> def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O =
> 4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE- 
> LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4]

Again, the first db_xref lacks the database in front of the colon. I  
can also see why "\," will trip up the parser in this format.

>
> and so is something for the GO guys to fix, I guess.

The lack of a database for certain xrefs surely is. If the escaped  
comma does throw off the BioPerl parser then that part is for BioPerl  
to fix. It does seem to extract the parts correctly, if the error  
message is any indication, though you may argue that it should remove  
the escaping backslashes (and I'd certainly agree with that).

>
>
> Another error is thrown after fixing the above, though (with the same
> command as before):
>
> Loading ontology Gene Ontology:
>         ... terms
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values  
> were
> ("GO:0006905","vesicle transport","OBSOLETE (was not defined before
> being made obsolete).","X","") FKs (1)
> Duplicate entry 'vesicle transport-1-X' for key 3
> ---------------------------------------------------
> Could not store term GO:0006905, name 'vesicle transport':
> [...]
> There are duplicate terms, identical in the term table except for  
> GOID:
> GO:0006905 and GO:0005480.  They are both "vesicle transport", and
> obsoleted:

That violates the uniqueness constraint, and this sounds more like a  
bug in the GO file. I'm also not sure what motivated them to create  
the same term multiple times only to obsolete it immediately.

> [...]
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
> were ("PMID","","0","") FKs ()
> Column 'accession' cannot be null
> ---------------------------------------------------
> Could not store term GO:0032933, name 'SREBP-mediated signaling
> pathway':
> [...]
> with the offending entry being
>
> term: SREBP-mediated signaling pathway
> goid: GO:0032933
> definition: A series of molecular signals from the endoplasmic  
> reticulum
> to the nucleus generated as a consequence of altered levels of one or
> more lipids, and resulting in the activation of transcription by  
> SREBP.
> definition_reference: GOC:mah
> definition_reference: PMID:0
>
> I commented out the definition_reference for PMID:0, which seemed  
> to fix
> matters.

Right, it seems to be a bogus reference.

>
> The process.ontology and component.ontology files then went into the
> database without a hitch.  Thanks again for your help,

Fantastic you got it all loaded!

Note that you also have the --computetc switch which will compute the  
transitive closure for you automatically.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================








More information about the BioSQL-l mailing list