GenBank indexing Trouble

Peter Rice peter.rice at uk.lionbioscience.com
Wed Sep 11 10:29:41 UTC 2002


Hironori Kawai wrote:
> Thanks for uploading new version quickly.
> The problem I had reported did not occur in new version.

Thanks.

> But, I would like to discuss another issue.
> In my previous report, I mentioned duplicate 
> ID 'AY071141'. The duplicate entries are shown below.
> --------------------------------------------------
> LOCUS       AY071141                2622 bp   
> DEFINITION  Drosophila melanogaster RE17910 full length cDNA.
> ACCESSION   AY071141
> 
> LOCUS       AY071141                2958 bp  
> DEFINITION  Drosophila melanogaster RE17910 full insert cDNA.
> ACCESSION   AY119119
> --------------------------------------------------- 
> Even if I use AY119119 with entret/seqret, the former entry is output.
> I think it is dangerous because it's difficult to notice incorrect entries have been output.
> In this case, I wish entret/seqret output the latter entry or output no entry but warning.

This is a known problem. Only one AY071141 entry can be indexed, but EMBOSS 
does not have control over which of the 2 (or more) entries will be found 
first in the sorted index. We save all information from the duplicates 
(accession for example) simply because we do not know which one to discard. 
So your search results are 'correct'.

The root cause in EMBOSS is that the EMBLCD/Staden index format EMBOSS uses 
can only have one unique ID for each entry. The solution is likely to be a 
new EMBOSS index format.

The root cause in real life is databases that have duplicate IDs.
Surprising that it can happen in GenBank

regards,

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723




More information about the EMBOSS mailing list