[EMBOSS] seqret output sequence format "ncbi"
john walshaw (JIC)
john.walshaw at bbsrc.ac.uk
Mon Aug 18 17:37:50 UTC 2008
Hello,
I'm having trouble getting seqret to return the expected FASTA-header
style when using the 'ncbi' output sequence format, when applying it to
either the native UniProt data files or an EMBOSS database made from
them.
In the manual for seqret, in the section "Output Format...", this is the
description of the "ncbi" style of FASTA format:
ncbi multiple NCBI style FASTA format with the database name, entry
name and accession number separated by pipe ("|") characters.
When I apply seqret to the current native-format UniProt files (Jul
22nd,UniProt release 14.0) with these arguments I get the following
FASTA-format headers:
seqret uniprot_sprot.dat -outseq stdout -osf ncbi | grep '^>' | head -5
>gnl|unk|104K_THEAN (Q4U9M9) RecName: Full=104 kDa microneme/rhoptry
antigen; AltName: Full=p104; Flags: Precursor;
>gnl|unk|104K_THEPA (P15711) RecName: Full=104 kDa microneme/rhoptry
antigen; AltName: Full=p104; Flags: Precursor;
>gnl|unk|108_SOLLC (Q43495) RecName: Full=Protein 108; Flags: Precursor;
>gnl|unk|10KD_VIGUN (P18646) RecName: Full=10 kDa protein; AltName:
Full=Clone PSAS10; Flags: Precursor;
>gnl|unk|110KD_PLAKN (P13813) RecName: Full=110 kDa antigen; AltName:
Full=PK110; Flags: Fragment;
- shouldn't this be something like this instead:
>unk|104K_THEAN|Q4U9M9 .....
>unk|104K_THEPA|P15711 ....
etc?
seqret seems to be identifying the ID and AccNo separately ok, because
if I specify 'fasta' or 'pearson' as the output format I get the
expected headers, i.e.:
>104K_THEAN Q4U9M9 RecName: Full=104 kDa microneme/rhoptry antigen;
AltName: Full=p104; Flags: Precursor;
etc.
I've found this behaviour in EMBOSS 5.0.0 and 6.0.1. If I apply seqret
to an EMBOSS database I've made by running dbxflat on the native UniProt
files, then -osf ncbi gives me the same format as when applied directly
to the files, except that the database name appears instead of 'unk'.
By the way, is there a way of making seqret return the same style header
as WU-BLAST sp2fasta, i.e. >db|accno|id .... (instead of
>db|id|accno), or is this what the ncbi format is intended to do?
best wishes,
Dr John Walshaw
Department of Computational & Systems Biology
John Innes Centre
Colney
Norwich NR4 7UH
UK
More information about the EMBOSS
mailing list