[BioSQL-l] Problem loading GenPept files into mysql biosql
Chris Fields
cjfields at uiuc.edu
Tue Jun 6 14:49:15 UTC 2006
I had similar issues with these files. Hilmar had this to say in a previous
post to bioperl-l:
http://article.gmane.org/gmane.comp.lang.perl.bio.general/10068
> I would generally advise against taking Uniprot/Swissprot entries from
> their GenPept reincarnation. The formats are incompatible in some
> aspects (e.g., Swissprot, like EMBL, has first-level db_xrefs, whereas
> GenBank format doesn't; instead it puts db_xrefs into the feature
> table).
If you really need these sequences, you should probably grab them from
UniProt/SwissProt directly and use 'swiss' Bio::SeqIO format. I never found
a use for them since there was normally a properly-parsed GenBank
counterpart (though I hate the fact that they seemingly clog up the works).
We probably should have some warning added for these sequences in bioperl or
bioperl-db if this keeps popping up, though.
Chris
> -----Original Message-----
> From: biosql-l-bounces at lists.open-bio.org [mailto:biosql-l-
> bounces at lists.open-bio.org] On Behalf Of Neil Saunders
> Sent: Tuesday, June 06, 2006 12:42 AM
> To: biosql-l at lists.open-bio.org
> Subject: [BioSQL-l] Problem loading GenPept files into mysql biosql
>
> hi,
>
> I've installed the MySQL BioSQL schema (Ubuntu Linux 5.10, BioPerl 1.5,
> MySQL
> 4.1.12). I have written a script that uses Bio::DB::GenPept to retrieve
> files
> by GI and then tries to load them using load_seqdatabase.pl:
>
> load_seqdatabase.pl --safe --dbname DBNAME --dbuser DBUSER --dbpass DBPASS
> --namespace genpept --format genbank <files>
>
> I'm getting a lot of errors of type:
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
> were
> ("","Direct Submission","Submitted (11-SEP-2004) National Center for
> Biotechnology Information, NIH, Bethesda, MD 20894,
> USA","CRC-EFE0D20CE0E07E7D","1","637","") FKs (<NULL>)
> Duplicate entry 'CRC-EFE0D20CE0E07E7D' for key 3
> ---------------------------------------------------
>
> This seems to be related to a similar problem using UniProt discussed on
> this list:
>
> http://lists.open-bio.org/pipermail/biosql-l/2006-May/000977.html
>
> Am I right in thinking that a CRC is generated from the JOURNAL line of a
> GenPept file and that non-unique CRCs are causing this problem? My
> GenPept
> files are actually RefSeq entries from complete microbial genomes. An
> example
> would be NP_378145 (GI 15922476). The REFERENCE for such records is often
> "Direct Submission" rather than a journal and obviously in these cases,
> the set
> of all proteins from a genome has the same REFERENCE, so unique CRCs don't
> seem
> like a good idea.
>
> I'd be grateful if anyone could confirm that these records are a problem
> and
> suggest any workarounds,
>
> thanks,
> Neil
> --
> School of Molecular and Microbial Sciences
> University of Queensland
> Brisbane 4072 Australia
>
> http://psychro.bioinformatics.unsw.edu.au/neil
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
More information about the BioSQL-l
mailing list