[EMBOSS] Problems with GenBank indexing
Natalia Jimenez Lozano
natalia.jimenez at pcm.uam.es
Thu Apr 6 07:56:06 UTC 2006
Hi everybody,
I was trying to retrieve fasta protein sequences from GenBank by id
using seqret but it was not possible for every id. However, retrieval by
GI is allowed.
Additionally, during the indexing process (dbifasta) I've obtained some
errors like this one:
Warning: Duplicate ID skipped: 'AC000348_16' All hits will point to
first ID found
I was looking for an explanation to this behaviour and I've found that
skipped IDs correspond to CDS from genomic sequences and have this format:
>gi|10121909|gb|AAG13419.1|AC000348_16 T7N9.24 [Arabidopsis thaliana]
MELPDVPVWRRVIVSAFFEALTFNIDIEEERSEIMMKTGAVVSNPRSRVKWDAFLSFQRDTSHNFTDRLY...
>gi|8778864|gb|AAF79863.1|AC000348_16 T7N9.28 [Arabidopsis thaliana]
MSVVLQITKDWVQALLGFLLLSFANISTRTNHKHFPHGSCSSIMAGFWIYMYIYSYLFITLKIIDLTS...
In the previous entries, when I try to retrieve one of them by the first
identifier (gi), I can get both of them. When I try to do retrievals
using the last identifier (AC000348_16), I only get the first one. But
it's impossible to do retrievals by second identifier (AAG13419.1 and
AAF79863.1).
However, sequences with the following format can be well indexed:
>gi|64029|emb|CAA23986.1| reading frame [Lophius americanus]
MKMVSSSRLRCLLVLLLSLTASISCSFAGQRDSKLRLLLHRYPLQGSKQDMTRSALAELLLSDLLQGENE ...
and these sequences can be well retrieved by first and second
identifiers (64029 and CAA23986.1).
Does anybody know how to solve these problems?
Thanks in advance,
Natalia
More information about the EMBOSS
mailing list