[EMBOSS] Long Refseq accessions and dbifasta/seqret bug??
Rob Pollock
rob.pollock at gmail.com
Tue Aug 23 03:34:07 UTC 2005
Hi,
I download human.protein.faa from NCBI on a weekly basis and use dbifasta to
build an emboss database. However, I have recently noticed that
seqret has problems
finding sequences with certain accessions, even though they are in the database.
I suspect this is something to do with the new long RefSeq accessions
and the way
the index is built. Here is an example: (I call my RefSeq protein
database 'genprot' for
want of a better name)
I am using EMBOSS 3.0.0 btw. (also Version 2.10.0 did the same thing)
This fails:
% seqret 'genprot:NP_001015.1'
Reads and writes (returns) sequences
Error: Unable to read sequence 'genprot:NP_001015.1'
Died: seqret terminated: Bad value for '-sequence' and no prompt
However this works:
% seqret 'genprot:NP_001015'
Reads and writes (returns) sequences
Output sequence [np_001015.fasta]:
% more np_001015.fasta
>NP_001015.1 NP_001015.1 ribosomal protein S21 [Homo sapiens]
MQNDAGEFVDLYVPRKCSASNRIIGAKDHASIQMNVAEVDKVTGRFNGQFKTYAICGAIR
RMGESDDSILRLAKADGIVSKNF
If I search human.protein.faa for strings that match NP_0010150 get a
whole list:
>gi|62632744|ref|NP_001015050.1| hypothetical protein LOC200810 [Homo sapiens]
>gi|62821803|ref|NP_001015884.1| RPB11b2alpha protein [Homo sapiens]
>gi|62865862|ref|NP_001015508.1| purine-rich element binding protein G
isoform B [Homo sapiens]
>gi|62865641|ref|NP_001015879.1| aurora kinaseatching NP_001015 I get
a whole list:
... etc
Again, using the full accession for any one of these fails:
% seqret 'genprot:NP_001015071.1'
Reads and writes (returns) sequences
Error: Unable to read sequence 'genprot:NP_001015071.1'
Died: seqret terminated: Bad value for '-sequence' and no prompt
But, again, the truncated accession works:
% seqret 'genprot:NP_001015071'
Reads and writes (returns) sequences
Output sequence [np_001015071.fasta]:
Other sequences searched by full accession work fine (as long as
they're _in_ the
database!)
% seqret 'genprot:NP_000544.1'
Reads and writes (returns) sequences
Output sequence [np_000544.fasta]:
I think the long accessions are calling dbifasta/seqret to spit the
dummy somehow.
Note:This is how I format my database
% dbifasta -dbname genprot -idformat ncbi -directory ~/genprot
-filenames \*.faa -auto
Maybe there is a workaround at this point?? Any suggestions??
[I have a file that includes sequences I need that aren't found in the current
release of Refseq which is diff.faa, hence the -filenames \*.faa bit.]
Thanks in advance
Rob Pollock.
More information about the EMBOSS
mailing list