GenBank indexing Trouble
Peter Rice
peter.rice at uk.lionbioscience.com
Wed Sep 11 10:29:41 UTC 2002
Hironori Kawai wrote:
> Thanks for uploading new version quickly.
> The problem I had reported did not occur in new version.
Thanks.
> But, I would like to discuss another issue.
> In my previous report, I mentioned duplicate
> ID 'AY071141'. The duplicate entries are shown below.
> --------------------------------------------------
> LOCUS AY071141 2622 bp
> DEFINITION Drosophila melanogaster RE17910 full length cDNA.
> ACCESSION AY071141
>
> LOCUS AY071141 2958 bp
> DEFINITION Drosophila melanogaster RE17910 full insert cDNA.
> ACCESSION AY119119
> ---------------------------------------------------
> Even if I use AY119119 with entret/seqret, the former entry is output.
> I think it is dangerous because it's difficult to notice incorrect entries have been output.
> In this case, I wish entret/seqret output the latter entry or output no entry but warning.
This is a known problem. Only one AY071141 entry can be indexed, but EMBOSS
does not have control over which of the 2 (or more) entries will be found
first in the sorted index. We save all information from the duplicates
(accession for example) simply because we do not know which one to discard.
So your search results are 'correct'.
The root cause in EMBOSS is that the EMBLCD/Staden index format EMBOSS uses
can only have one unique ID for each entry. The solution is likely to be a
new EMBOSS index format.
The root cause in real life is databases that have duplicate IDs.
Surprising that it can happen in GenBank
regards,
Peter
--
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723
More information about the EMBOSS
mailing list