[EMBOSS] notseq and fasta definition headers

Peter Rice pmr at ebi.ac.uk
Tue Jun 17 20:28:47 UTC 2008


Andres Pinzon wrote:
> The output is correct, but notseq changes the definition in the fasta
> headers, so if the fasta header in "xaa.list.fasta" was:
> 
> lcl|29855|ORF26673_6
> 
> the corresponding fasta header in sequence in 1000-1.fasta is:
> 
> 29855
> 
> Is there a way to tell "notseq" to keep the original fasta headers intact?

Yes.

FASTA format is not simple ... we have seen many ways to hide extra 
information in the ID (EMBOSS recognizes NCBI id formats and parses out 
the ID 29855) and also in the description (we try to recognize 
conventions used by GCG and ACEDB)

But you can also specify "pearson" format which reads the ID without 
parsing. Just add to the commandline:

notseq -sf pearson

Now you have another problem. This will not work for notseq!!!

The exclude string in notseq is a pattern. In processing the pattern, 
some pattern characters are removed:

	whitespace
	',' and ';'
	'|'

So your exclude pattern cannot include any '|' chatracters.

As a workaround, you can exclude "*ORF26673_6" and the IDs will be 
preserved.

For the next release we will allow '|' characters. When notseq was first 
written there was a possibility to use regualr expressions, but now we 
only use simple text matching so the pipe characters are not a problem.

Hope that helps

Peter




More information about the EMBOSS mailing list