[BioSQL-l] Problem loading GenPept files into mysql biosql

Tue Jun 6 05:41:34 UTC 2006

hi,

I've installed the MySQL BioSQL schema (Ubuntu Linux 5.10, BioPerl 1.5, MySQL 
4.1.12).  I have written a script that uses Bio::DB::GenPept to retrieve files 
by GI and then tries to load them using load_seqdatabase.pl:

load_seqdatabase.pl --safe --dbname DBNAME --dbuser DBUSER --dbpass DBPASS 
--namespace genpept --format genbank <files>

I'm getting a lot of errors of type:

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were 
("","Direct Submission","Submitted (11-SEP-2004) National Center for 
Biotechnology Information, NIH, Bethesda, MD 20894, 
USA","CRC-EFE0D20CE0E07E7D","1","637","") FKs (<NULL>)
Duplicate entry 'CRC-EFE0D20CE0E07E7D' for key 3
---------------------------------------------------

This seems to be related to a similar problem using UniProt discussed on this list:

http://lists.open-bio.org/pipermail/biosql-l/2006-May/000977.html

Am I right in thinking that a CRC is generated from the JOURNAL line of a 
GenPept file and that non-unique CRCs are causing this problem?  My GenPept 
files are actually RefSeq entries from complete microbial genomes.  An example 
would be NP_378145 (GI 15922476).  The REFERENCE for such records is often 
"Direct Submission" rather than a journal and obviously in these cases, the set 
of all proteins from a genome has the same REFERENCE, so unique CRCs don't seem 
like a good idea.

I'd be grateful if anyone could confirm that these records are a problem and 
suggest any workarounds,

thanks,
Neil
-- 
  School of Molecular and Microbial Sciences
  University of Queensland
  Brisbane 4072 Australia

http://psychro.bioinformatics.unsw.edu.au/neil