[EMBOSS] seqret output sequence format "ncbi"
Peter Rice
pmr at ebi.ac.uk
Tue Aug 19 08:29:46 UTC 2008
john walshaw (JIC) wrote:
> I'm having trouble getting seqret to return the expected FASTA-header
> style when using the 'ncbi' output sequence format, when applying it to
> either the native UniProt data files or an EMBOSS database made from
> them.
>
> In the manual for seqret, in the section "Output Format...", this is the
> description of the "ncbi" style of FASTA format:
>
> ncbi multiple NCBI style FASTA format with the database name, entry
> name and accession number separated by pipe ("|") characters.
This could be extended to explain that NCBI also have an annoyingly
short list of valid database names. Any other name has to appear as
"gnl|dbname", as you see for your uniprox database indexd with dbxflat.
We use "unk" if we have no known database name, but we treat it as a
general name - NCBI's "unk|identifier" is something special to them.
If you use one of the "NCBI list" database names, for example adding
"-sdbname sp" to the command line, you will get a swissprot NCBI sandard
identifier - but this is because "sp" is one of their special names. You
cannot even assume the data is protein if you see "sp" in the identifier
(genpept for example uses emb and gb as database names for protein
sequences).
> By the way, is there a way of making seqret return the same style header
> as WU-BLAST sp2fasta, i.e. >db|accno|id .... (instead of
>> db|id|accno), or is this what the ncbi format is intended to do?
Hmmmm .... yet another FASTA format (and see below for another one).
Yes, that looks like a good idea. We need an output name for it, perhaps
wublast is the best choice.
You emntioned UniProt 14 - the latest release also includes extensions
to the Fasta format description to tag species and other information. We
are considering making this the default version of the FASTA format for
EMBOSS so we can preserve more information - does this sound like a good
idea?
For example: >sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen
OS=Theileria annulata GN=TA08425 PE=3 SV=1
Also on the subject of UniProt 14 - the .dat flat files have a new
syntax for the DE lines. we had to ignore that as the cange appeared
just before EMBOSS 6.0.0 Is anyone interested in having the details
parsed out, or in having the original friendly description generated?
ID 104K_THEAN Reviewed; 893 AA.
AC Q4U9M9;
DT 18-APR-2006, integrated into UniProtKB/Swiss-Prot.
DT 05-JUL-2005, sequence version 1.
DT 22-JUL-2008, entry version 18.
DE RecName: Full=104 kDa microneme/rhoptry antigen;
DE AltName: Full=p104;
DE Flags: Precursor;
Hope this helps, even if it adds some new questions!
Peter
More information about the EMBOSS
mailing list