[EMBOSS] Long Refseq accessions and dbifasta/seqret bug?? SOLVED.. workaround
Rob Pollock
rob.pollock at gmail.com
Wed Aug 24 05:22:47 UTC 2005
Ok, well if you use dbxfasta and define a "resource" with sufficiently long
(id and acc) fields, it works.
% tfm dbxfasta
for what I mean by resource
On 8/23/05, Rob Pollock <rob.pollock at gmail.com> wrote:
> Hi,
>
> I download human.protein.faa from NCBI on a weekly basis and use dbifasta to
> build an emboss database. However, I have recently noticed that
> seqret has problems
> finding sequences with certain accessions, even though they are in the database.
>
> I suspect this is something to do with the new long RefSeq accessions
> and the way
> the index is built. Here is an example: (I call my RefSeq protein
> database 'genprot' for
> want of a better name)
>
> I am using EMBOSS 3.0.0 btw. (also Version 2.10.0 did the same thing)
>
> This fails:
>
> % seqret 'genprot:NP_001015.1'
> Reads and writes (returns) sequences
> Error: Unable to read sequence 'genprot:NP_001015.1'
> Died: seqret terminated: Bad value for '-sequence' and no prompt
>
> However this works:
> % seqret 'genprot:NP_001015'
> Reads and writes (returns) sequences
> Output sequence [np_001015.fasta]:
>
> % more np_001015.fasta
> >NP_001015.1 NP_001015.1 ribosomal protein S21 [Homo sapiens]
> MQNDAGEFVDLYVPRKCSASNRIIGAKDHASIQMNVAEVDKVTGRFNGQFKTYAICGAIR
> RMGESDDSILRLAKADGIVSKNF
>
> If I search human.protein.faa for strings that match NP_0010150 get a
> whole list:
>
> >gi|62632744|ref|NP_001015050.1| hypothetical protein LOC200810 [Homo sapiens]
> >gi|62821803|ref|NP_001015884.1| RPB11b2alpha protein [Homo sapiens]
> >gi|62865862|ref|NP_001015508.1| purine-rich element binding protein G
> isoform B [Homo sapiens]
> >gi|62865641|ref|NP_001015879.1| aurora kinaseatching NP_001015 I get
> a whole list:
> ... etc
>
> Again, using the full accession for any one of these fails:
>
> % seqret 'genprot:NP_001015071.1'
> Reads and writes (returns) sequences
> Error: Unable to read sequence 'genprot:NP_001015071.1'
> Died: seqret terminated: Bad value for '-sequence' and no prompt
>
> But, again, the truncated accession works:
> % seqret 'genprot:NP_001015071'
> Reads and writes (returns) sequences
> Output sequence [np_001015071.fasta]:
>
> Other sequences searched by full accession work fine (as long as
> they're _in_ the
> database!)
>
> % seqret 'genprot:NP_000544.1'
> Reads and writes (returns) sequences
> Output sequence [np_000544.fasta]:
>
> I think the long accessions are calling dbifasta/seqret to spit the
> dummy somehow.
>
> Note:This is how I format my database
>
> % dbifasta -dbname genprot -idformat ncbi -directory ~/genprot
> -filenames \*.faa -auto
>
> Maybe there is a workaround at this point?? Any suggestions??
>
> [I have a file that includes sequences I need that aren't found in the current
> release of Refseq which is diff.faa, hence the -filenames \*.faa bit.]
>
> Thanks in advance
> Rob Pollock.
>
More information about the EMBOSS
mailing list