[EMBOSS] Long Refseq accessions and dbifasta/seqret bug??

Rob Pollock rob.pollock at gmail.com
Tue Aug 23 03:34:07 UTC 2005


Hi,

I download human.protein.faa from NCBI on a weekly basis and use dbifasta to 
build an emboss database.  However, I have recently noticed that
seqret has problems
finding sequences with certain accessions, even though they are in the database.

I suspect this is something to do with the new long RefSeq accessions
and the way
the index is built.  Here is an example: (I call my RefSeq protein
database 'genprot' for
want of a better name)

I am using EMBOSS 3.0.0 btw. (also Version 2.10.0 did the same thing)

This fails:

  % seqret 'genprot:NP_001015.1'
  Reads and writes (returns) sequences
  Error: Unable to read sequence 'genprot:NP_001015.1'
  Died: seqret terminated: Bad value for '-sequence' and no prompt

However this works:
  % seqret 'genprot:NP_001015'
  Reads and writes (returns) sequences
  Output sequence [np_001015.fasta]:

  % more np_001015.fasta
  >NP_001015.1 NP_001015.1 ribosomal protein S21 [Homo sapiens]
  MQNDAGEFVDLYVPRKCSASNRIIGAKDHASIQMNVAEVDKVTGRFNGQFKTYAICGAIR
  RMGESDDSILRLAKADGIVSKNF

If I search human.protein.faa for strings that match NP_0010150 get a
whole list:

>gi|62632744|ref|NP_001015050.1| hypothetical protein LOC200810 [Homo sapiens]
>gi|62821803|ref|NP_001015884.1| RPB11b2alpha protein [Homo sapiens]
>gi|62865862|ref|NP_001015508.1| purine-rich element binding protein G
isoform B [Homo sapiens]
>gi|62865641|ref|NP_001015879.1| aurora kinaseatching NP_001015 I get
a whole list:
... etc

Again, using the full accession for any one of these fails:

  % seqret 'genprot:NP_001015071.1'
  Reads and writes (returns) sequences
  Error: Unable to read sequence 'genprot:NP_001015071.1'
  Died: seqret terminated: Bad value for '-sequence' and no prompt

But, again, the truncated accession works:
  % seqret 'genprot:NP_001015071'
  Reads and writes (returns) sequences
  Output sequence [np_001015071.fasta]:

Other sequences searched by full accession work fine (as long as
they're _in_ the
database!)

   % seqret 'genprot:NP_000544.1'
  Reads and writes (returns) sequences
  Output sequence [np_000544.fasta]:

I think the long accessions are calling dbifasta/seqret to spit the
dummy somehow.

Note:This is how I format my database

  % dbifasta -dbname genprot -idformat ncbi -directory ~/genprot
-filenames \*.faa -auto

Maybe there is a workaround at this point?? Any suggestions??

[I have a  file that includes sequences I need that aren't found in the current
release of Refseq which is diff.faa, hence the -filenames \*.faa bit.]

Thanks in advance
Rob Pollock.




More information about the EMBOSS mailing list