[EMBOSS] Problems with GenBank indexing
Peter Rice
pmr at ebi.ac.uk
Mon Apr 10 10:44:47 UTC 2006
Natalia Jimenez Lozano wrote:
> I was looking for an explanation to this behaviour and I've found that
> skipped IDs correspond to CDS from genomic sequences and have this format:
>
> >gi|10121909|gb|AAG13419.1|AC000348_16 T7N9.24 [Arabidopsis thaliana]
> MELPDVPVWRRVIVSAFFEALTFNIDIEEERSEIMMKTGAVVSNPRSRVKWDAFLSFQRDTSHNFTDRLY...
> >gi|8778864|gb|AAF79863.1|AC000348_16 T7N9.28 [Arabidopsis thaliana]
> MSVVLQITKDWVQALLGFLLLSFANISTRTNHKHFPHGSCSSIMAGFWIYMYIYSYLFITLKIIDLTS...
As Jon says, dbxfasta is a solution.
However, that is only a partial solution. The real problem is that these FASTA
format sequences do indeed have duplicate IDs.
This is protein sequence data, so it is not GenBank - was this GenPept or some
other database?
GenPept and other databases have been known to report "gb" or "emb" as the
database for protein sequences!!!
A possible solution is to add a new ID format to dbifasta and dbxfasta that
uses AAG13419 and AAF7986 as the ID and ignores the AC000348_16 part.
Hope this helps,
Peter
More information about the EMBOSS
mailing list