[EMBOSS] Problems with GenBank indexing

Natalia Jimenez Lozano natalia.jimenez at pcm.uam.es
Fri Apr 7 12:50:16 UTC 2006


Dear Jon,

> Dear Natalia
>
> By default, dbifasta will index the ID name and the accession number (if present).
>
> To index the Sequence Version, GI number and words in the description, you must
> run dbifasta with the '-fields' qualifier, e.g. "-fields acc", "-fields sv acc"
> etc.   If you don't, you will not be able to retrieve by those fields. Please
> see http://emboss.sourceforge.net/apps/cvs/dbifasta.html.
>   
Yes indexation was done taking into account the -field parameter :-(
> dbifasta only retrieves the first of any duplicate entries.  So far as I'm aware
> dbxfasta can retrieve duplicate entries.
>   
We'll try with dbxfasta!
> Does that help?  Feel free to get back in touch.
>   
Yes, a lot.
Thank you very much
Regards,
Natalia
> Cheers
>
> Jon
>
>
>
>
>   
>> Hi everybody,
>>
>> I was trying to retrieve fasta protein sequences from GenBank by id
>> using seqret but it was not possible for every id. However, retrieval by
>> GI is allowed.
>>
>> Additionally, during the indexing process (dbifasta) I've obtained some
>> errors like this one:
>>
>> Warning: Duplicate ID skipped: 'AC000348_16' All hits will point to
>> first ID found
>>
>> I was looking for an explanation to this behaviour and I've found that
>> skipped IDs correspond to CDS from genomic sequences and have this format:
>>
>>  >gi|10121909|gb|AAG13419.1|AC000348_16 T7N9.24 [Arabidopsis thaliana]
>> MELPDVPVWRRVIVSAFFEALTFNIDIEEERSEIMMKTGAVVSNPRSRVKWDAFLSFQRDTSHNFTDRLY...
>>  >gi|8778864|gb|AAF79863.1|AC000348_16 T7N9.28 [Arabidopsis thaliana]
>> MSVVLQITKDWVQALLGFLLLSFANISTRTNHKHFPHGSCSSIMAGFWIYMYIYSYLFITLKIIDLTS...
>>
>> In the previous entries, when I try to retrieve one of them by the first
>> identifier (gi), I can get both of them. When I try to do retrievals
>> using the last identifier (AC000348_16), I only get the first one. But
>> it's impossible to do retrievals by second identifier (AAG13419.1 and
>> AAF79863.1).
>>
>> However, sequences with the following format can be well indexed:
>>
>>  >gi|64029|emb|CAA23986.1| reading frame [Lophius americanus]
>> MKMVSSSRLRCLLVLLLSLTASISCSFAGQRDSKLRLLLHRYPLQGSKQDMTRSALAELLLSDLLQGENE ...
>>
>> and these sequences can be well retrieved by first and second
>> identifiers (64029 and CAA23986.1).
>>
>> Does anybody know how to solve these problems?
>> Thanks in advance,
>> Natalia
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>>
>>     
>
>
>
>
>   




More information about the EMBOSS mailing list