[BioSQL-l] error loading uniprot release 49.6 into mysql

Hilmar Lapp hlapp at gmx.net
Mon May 15 16:59:06 UTC 2006


You found the right instance. Unfortunately with the way the bioperl  
swissprot parser works the group (RG) isn't promoted to author if  
there is no author in addition (in fact you may debate whether that  
would even be the best way of doing things), so it doesn't find it on  
second occurrence by unique key.

If you can live without this entry, or any other entry that causes a  
hiccup, just supply the flag --safe and it will gracefully move on to  
the next entry.

Fixing the issue would require either to fix the bioperl swissprot  
parser (or Bio::Annotation::Reference) to stick the RG group into the  
author slot if there is no author, or to fix Bioperl  
Bio::Annotation::Reference to also feature a group and biosql to use  
it in place of a missing author.

Actually there is $reference->rg. Maybe Bioperl-db (and hence Biosql)  
should just use that in place of a missing author?

The downside is that upon round-tripping an entry, the RG annotation  
line will become an RA annotation line. How bad would that be?

Any thoughts from anyone?

	-hilmar

On May 15, 2006, at 8:34 AM, s.rayner at att.net wrote:

> I found where the script is hiccuping....
>
> The Uniprot release contains lines with identical annotation for  
> the RL keyword for two different sequences.
>
> ___________________
>
> First occurence...
> ___________________
>
> ID   1433T_PONPY    STANDARD;      PRT;   245 AA.
> AC   Q5RFJ2; Q5RDK2;
> DT   05-JUL-2005, integrated into UniProtKB/Swiss-Prot.
> DT   05-JUL-2005, sequence version 2.
> DT   18-APR-2006, entry version 13.
> DE   14-3-3 protein theta.
> GN   Name=YWHAQ;
> OS   Pongo pygmaeus (Orangutan).
> OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
> OC   Catarrhini; Hominidae; Pongo.
> OX   NCBI_TaxID=9600;
> RN   [1]
> RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
> RC   TISSUE=Brain cortex, and Kidney;
> RG   The German cDNA consortium;
> RL   Submitted (NOV-2004) to the EMBL/GenBank/DDBJ databases.   
> <======  Not Unique
>
>
> ___________________
>
> Second occurence...
> ___________________
>
>
> ID   1433G_PONPY    STANDARD;      PRT;   246 AA.
> AC   Q5RC20;
> DT   05-JUL-2005, integrated into UniProtKB/Swiss-Prot.
> DT   05-JUL-2005, sequence version 2.
> DT   18-APR-2006, entry version 13.
> DE   14-3-3 protein gamma.
> GN   Name=YWHAG;
> OS   Pongo pygmaeus (Orangutan).
> OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
> OC   Catarrhini; Hominidae; Pongo.
> OX   NCBI_TaxID=9600;
> RN   [1]
> RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
> RC   TISSUE=Heart;
> RG   The German cDNA consortium;
> RL   Submitted (NOV-2004) to the EMBL/GenBank/DDBJ databases.    
> <======  Not Unique
>
>
>
> in these two cases the generated CRC key is identical and so MySQL  
> throws a wobbly.
>
> if i look at the MySQL entry in the REFERENCE table for the first  
> sequence
> ------+-------+---------+----------------------+
> |          139 |      NULL | Submitted (NOV-2004) to the EMBL/ 
> GenBank/DDBJ databases. | NULL  | NULL    | CRC-E7973FEA4B5611DC |
> +--------------+----------- 
> +----------------------------------------------------
>
> and the error when the script choked was
>
>  MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
> values were
>  ("","","Submitted (NOV-2004) to the EMBL/GenBank/DDBJ
>  databases.","CRC-E7973FEA4B5611DC","","","") FKs (<NULL)
>  Duplicate entry 'CRC-E7973FEA4B5611DC' for key 3
>
> hence the problem.
>
> I'm guessing i'm not the first person to encounter this, but dont  
> see any hints for an easy way around this.
>
> any suggestions....?
>
> ta
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================








More information about the BioSQL-l mailing list