[BioSQL-l] Problem loading GenPept files into mysql biosql

Chris Fields cjfields at uiuc.edu
Tue Jun 6 14:49:15 UTC 2006


I had similar issues with these files.  Hilmar had this to say in a previous
post to bioperl-l:

http://article.gmane.org/gmane.comp.lang.perl.bio.general/10068

> I would generally advise against taking Uniprot/Swissprot entries from
> their GenPept reincarnation. The formats are incompatible in some
> aspects (e.g., Swissprot, like EMBL, has first-level db_xrefs, whereas
> GenBank format doesn't; instead it puts db_xrefs into the feature
> table).

If you really need these sequences, you should probably grab them from
UniProt/SwissProt directly and use 'swiss' Bio::SeqIO format.  I never found
a use for them since there was normally a properly-parsed GenBank
counterpart (though I hate the fact that they seemingly clog up the works).
We probably should have some warning added for these sequences in bioperl or
bioperl-db if this keeps popping up, though.

Chris

> -----Original Message-----
> From: biosql-l-bounces at lists.open-bio.org [mailto:biosql-l-
> bounces at lists.open-bio.org] On Behalf Of Neil Saunders
> Sent: Tuesday, June 06, 2006 12:42 AM
> To: biosql-l at lists.open-bio.org
> Subject: [BioSQL-l] Problem loading GenPept files into mysql biosql
> 
> hi,
> 
> I've installed the MySQL BioSQL schema (Ubuntu Linux 5.10, BioPerl 1.5,
> MySQL
> 4.1.12).  I have written a script that uses Bio::DB::GenPept to retrieve
> files
> by GI and then tries to load them using load_seqdatabase.pl:
> 
> load_seqdatabase.pl --safe --dbname DBNAME --dbuser DBUSER --dbpass DBPASS
> --namespace genpept --format genbank <files>
> 
> I'm getting a lot of errors of type:
> 
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
> were
> ("","Direct Submission","Submitted (11-SEP-2004) National Center for
> Biotechnology Information, NIH, Bethesda, MD 20894,
> USA","CRC-EFE0D20CE0E07E7D","1","637","") FKs (<NULL>)
> Duplicate entry 'CRC-EFE0D20CE0E07E7D' for key 3
> ---------------------------------------------------
> 
> This seems to be related to a similar problem using UniProt discussed on
> this list:
> 
> http://lists.open-bio.org/pipermail/biosql-l/2006-May/000977.html
> 
> Am I right in thinking that a CRC is generated from the JOURNAL line of a
> GenPept file and that non-unique CRCs are causing this problem?  My
> GenPept
> files are actually RefSeq entries from complete microbial genomes.  An
> example
> would be NP_378145 (GI 15922476).  The REFERENCE for such records is often
> "Direct Submission" rather than a journal and obviously in these cases,
> the set
> of all proteins from a genome has the same REFERENCE, so unique CRCs don't
> seem
> like a good idea.
> 
> I'd be grateful if anyone could confirm that these records are a problem
> and
> suggest any workarounds,
> 
> thanks,
> Neil
> --
>   School of Molecular and Microbial Sciences
>   University of Queensland
>   Brisbane 4072 Australia
> 
> http://psychro.bioinformatics.unsw.edu.au/neil
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l




More information about the BioSQL-l mailing list