[EMBOSS] Long Refseq accessions and dbifasta/seqret bug?? SOLVED.. workaround

Wed Aug 24 05:22:47 UTC 2005

Ok, well if you use dbxfasta and define a "resource" with sufficiently long
(id and acc) fields, it works.
% tfm dbxfasta
for what I mean by resource

On 8/23/05, Rob Pollock <rob.pollock at gmail.com> wrote:
> Hi,
> 
> I download human.protein.faa from NCBI on a weekly basis and use dbifasta to
> build an emboss database.  However, I have recently noticed that
> seqret has problems
> finding sequences with certain accessions, even though they are in the database.
> 
> I suspect this is something to do with the new long RefSeq accessions
> and the way
> the index is built.  Here is an example: (I call my RefSeq protein
> database 'genprot' for
> want of a better name)
> 
> I am using EMBOSS 3.0.0 btw. (also Version 2.10.0 did the same thing)
> 
> This fails:
> 
>   % seqret 'genprot:NP_001015.1'
>   Reads and writes (returns) sequences
>   Error: Unable to read sequence 'genprot:NP_001015.1'
>   Died: seqret terminated: Bad value for '-sequence' and no prompt
> 
> However this works:
>   % seqret 'genprot:NP_001015'
>   Reads and writes (returns) sequences
>   Output sequence [np_001015.fasta]:
> 
>   % more np_001015.fasta
>   >NP_001015.1 NP_001015.1 ribosomal protein S21 [Homo sapiens]
>   MQNDAGEFVDLYVPRKCSASNRIIGAKDHASIQMNVAEVDKVTGRFNGQFKTYAICGAIR
>   RMGESDDSILRLAKADGIVSKNF
> 
> If I search human.protein.faa for strings that match NP_0010150 get a
> whole list:
> 
> >gi|62632744|ref|NP_001015050.1| hypothetical protein LOC200810 [Homo sapiens]
> >gi|62821803|ref|NP_001015884.1| RPB11b2alpha protein [Homo sapiens]
> >gi|62865862|ref|NP_001015508.1| purine-rich element binding protein G
> isoform B [Homo sapiens]
> >gi|62865641|ref|NP_001015879.1| aurora kinaseatching NP_001015 I get
> a whole list:
> ... etc
> 
> Again, using the full accession for any one of these fails:
> 
>   % seqret 'genprot:NP_001015071.1'
>   Reads and writes (returns) sequences
>   Error: Unable to read sequence 'genprot:NP_001015071.1'
>   Died: seqret terminated: Bad value for '-sequence' and no prompt
> 
> But, again, the truncated accession works:
>   % seqret 'genprot:NP_001015071'
>   Reads and writes (returns) sequences
>   Output sequence [np_001015071.fasta]:
> 
> Other sequences searched by full accession work fine (as long as
> they're _in_ the
> database!)
> 
>    % seqret 'genprot:NP_000544.1'
>   Reads and writes (returns) sequences
>   Output sequence [np_000544.fasta]:
> 
> I think the long accessions are calling dbifasta/seqret to spit the
> dummy somehow.
> 
> Note:This is how I format my database
> 
>   % dbifasta -dbname genprot -idformat ncbi -directory ~/genprot
> -filenames \*.faa -auto
> 
> Maybe there is a workaround at this point?? Any suggestions??
> 
> [I have a  file that includes sequences I need that aren't found in the current
> release of Refseq which is diff.faa, hence the -filenames \*.faa bit.]
> 
> Thanks in advance
> Rob Pollock.
>