dbifasta/seqret and ncbi-format fasta headers
john walshaw (JIC)
john.walshaw at bbsrc.ac.uk
Wed Apr 10 10:44:13 UTC 2002
I have a question about ncbi-type sequence headers in fasta-format files.
I'm
using EMBOSS 2.3.1.
The ncbi format for the dbifasta program is described variously as:
ncbi : >blah|...[|ACC]|ID
and
>...[|accno]|id ...
in the EMBOSS admin guide and by 'tfm dbifasta'.
>From these I assumed that within the first of the whitespace-delimited
'fields', the last two '|'-delimited subfields will be treated by dbifasta
as the accession no and ID respectively:
>gi|15375403|dbj|AB039926.1|AB039926 Arabidopsis ...blah...
^^^^^^^^^^ ^^^^^^^^
accno id
- but this doesn't work as seqret reports in this case that AB039926 is not
in
my database (which I indexed with dbifasta using idformat 'ncbi', and
specified
with method: emblcd format:fasta & the necessary dir: and indexdir:
fields).
But this sequence works (I can get it with seqret) -
>gi|15383574|gb|AV540904.2|AV540904 AV540904 Arabidopsis thaliana roots
...blah
^^^^^^^^ ^^^^^^^^
-because the second whitespace-delimited field is present AND identical to
the
previous subfield. The 2nd field is not simply being used as the accno,
because
for example this entry:
>gi|15383574|gb|AV540904.2|XXXXXXX YYYYYYY
cannot be returned by seqret either as XXXXXXX or YYYYYYY (or by any means
other than requesting all sequences in the DB).
Am I doing something stupid? I've looked into this problem a lot, and can
provide debug files for seqret & dbifasta, and I'm sure my db specification
in
emboss.default is correct. For the sequences which fail, seqret reads the
correct header line, but then thinks that accno=''. And seqret always
returns
the id as 'gi' (even for sequences which can be fetched normally). All of
the
correct accnos (e.g. AV540904.2) appear in the acnum.trg file.
Regards,
John Walshaw
John Innes Centre, Norwich Research Park,
Colney, Norwich NR4 7UH, UK. +44(0)1603 450827
More information about the EMBOSS
mailing list