[EMBOSS] seqret output sequence format "ncbi"

Tue Aug 19 08:29:46 UTC 2008

john walshaw (JIC) wrote:
  > I'm having trouble getting seqret to return the expected FASTA-header
> style when using the 'ncbi' output sequence format, when applying it to
> either the native UniProt data files or an EMBOSS database made from
> them.
>  
> In the manual for seqret, in the section "Output Format...", this is the
> description of the "ncbi" style of FASTA format:
>  
> ncbi multiple NCBI style FASTA format with the database name, entry
>    name and accession number separated by pipe ("|") characters.

This could be extended to explain that NCBI also have an annoyingly 
short list of valid database names. Any other name has to appear as 
"gnl|dbname", as you see for your uniprox database indexd with dbxflat. 
We use "unk" if we have no known database name, but we treat it as a 
general name - NCBI's "unk|identifier" is something special to them.

If you use one of the "NCBI list" database names, for example adding 
"-sdbname sp" to the command line, you will get a swissprot NCBI sandard 
identifier - but this is because "sp" is one of their special names. You 
cannot even assume the data is protein if you see "sp" in the identifier 
(genpept for example uses emb and gb as database names for protein 
sequences).

> By the way, is there a way of making seqret return the same style header
> as WU-BLAST sp2fasta, i.e. >db|accno|id  ....  (instead of
>> db|id|accno), or is this what the ncbi format is intended to do?

Hmmmm .... yet another FASTA format (and see below for another one). 
Yes, that looks like a good idea. We need an output name for it, perhaps 
wublast is the best choice.

You emntioned UniProt 14 - the latest release also includes extensions 
to the Fasta format description to tag species and other information. We 
are considering making this the default version of the FASTA format for 
EMBOSS so we can preserve more information - does this sound like a good 
idea?

For example: >sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen 
OS=Theileria annulata GN=TA08425 PE=3 SV=1

Also on the subject of UniProt 14 - the .dat flat files have a new 
syntax for the DE lines. we had to ignore that as the cange appeared 
just before EMBOSS 6.0.0 Is anyone interested in having the details 
parsed out, or in having the original friendly description generated?

ID   104K_THEAN              Reviewed;         893 AA.
AC   Q4U9M9;
DT   18-APR-2006, integrated into UniProtKB/Swiss-Prot.
DT   05-JUL-2005, sequence version 1.
DT   22-JUL-2008, entry version 18.
DE   RecName: Full=104 kDa microneme/rhoptry antigen;
DE   AltName: Full=p104;
DE   Flags: Precursor;

Hope this helps, even if it adds some new questions!

Peter