[EMBOSS] seqret output sequence format "ncbi"

Tue Aug 19 09:25:36 UTC 2008

Thanks for your help Peter, please see comments below. 

> -----Original Message-----
> From: Peter Rice [mailto:pmr at ebi.ac.uk] 
> Sent: 19 August 2008 09:30
> To: john walshaw (JIC)
> Cc: emboss at emboss.open-bio.org
> Subject: Re: [EMBOSS] seqret output sequence format "ncbi"
> 
> john walshaw (JIC) wrote:
>   > I'm having trouble getting seqret to return the expected 
> FASTA-header
> > style when using the 'ncbi' output sequence format, when 
> applying it 
> > to either the native UniProt data files or an EMBOSS database made 
> > from them.
> >  
> > In the manual for seqret, in the section "Output 
> Format...", this is 
> > the description of the "ncbi" style of FASTA format:
> >  
> > ncbi multiple NCBI style FASTA format with the database name, entry
> >    name and accession number separated by pipe ("|") characters.
> 
> This could be extended to explain that NCBI also have an 
> annoyingly short list of valid database names. Any other name 
> has to appear as "gnl|dbname", as you see for your uniprox 
> database indexd with dbxflat. 
> We use "unk" if we have no known database name, but we treat 
> it as a general name - NCBI's "unk|identifier" is something 
> special to them.
> 
> If you use one of the "NCBI list" database names, for example 
> adding "-sdbname sp" to the command line, you will get a 
> swissprot NCBI sandard identifier - but this is because "sp" 
> is one of their special names. You cannot even assume the 
> data is protein if you see "sp" in the identifier (genpept 
> for example uses emb and gb as database names for protein sequences).
> 

Thanks, I see, that does it. So if I had named my database 'sp' instead
of 'uniprot' this would have worked automatically. 

> > By the way, is there a way of making seqret return the same style 
> > header as WU-BLAST sp2fasta, i.e. >db|accno|id  ....  (instead of
> >> db|id|accno), or is this what the ncbi format is intended to do?
> 
> Hmmmm .... yet another FASTA format (and see below for another one). 
> Yes, that looks like a good idea. We need an output name for 
> it, perhaps wublast is the best choice.
> 

BTW the reason I ask this is that sp2fasta doesn't seem able to handle
the format changes in the DE line which have appeared in release 14.0.
My theory is that the format change, re below, occasionally makes
some DE lines very long, and too long for sp2fasta to read, so the
final continuation character (semicolon) is missed. This means that
if the next line is also a DE line, sp2fasta halts with a fatal error
about there being multiple definition lines for one sequence record.
Examples in release 14.0 are Q5IGR8 (in uniprot_sprot.dat) and A0B5I3
(in uniprot_trembl.dat). If there's a v. long DE line but the next
line isn't a DE line, then it doesn't matter and sp2fasta is ok.

> You emntioned UniProt 14 - the latest release also includes 
> extensions to the Fasta format description to tag species and 
> other information. We are considering making this the default 
> version of the FASTA format for EMBOSS so we can preserve 
> more information - does this sound like a good idea?
> 
> For example: >sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry 
> antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1
> 

Personally, I think this would be a good idea. I'm assuming that
EMBOSS progs would themselves be able to parse these fields from the
FASTA headers?

> 
> Also on the subject of UniProt 14 - the .dat flat files have a new 
> syntax for the DE lines. we had to ignore that as the cange appeared 
> just before EMBOSS 6.0.0 Is anyone interested in having the details 
> parsed out, or in having the original friendly description generated?
> 
> ID   104K_THEAN              Reviewed;         893 AA.
> AC   Q4U9M9;
> DT   18-APR-2006, integrated into UniProtKB/Swiss-Prot.
> DT   05-JUL-2005, sequence version 1.
> DT   22-JUL-2008, entry version 18.
> DE   RecName: Full=104 kDa microneme/rhoptry antigen;
> DE   AltName: Full=p104;
> DE   Flags: Precursor;

Having the option to parse them out would be useful :) These multiple
names can be a bit awkward sometimes, so if UniProt and EMBOSS do some
of the work for you, that's got to be good.

> 
> Hope this helps, even if it adds some new questions!
> 

Certainly does, many thanks again.

cheers

John

> Peter
>