[EMBOSS] iep/gifasta

Tue Dec 18 09:23:18 UTC 2007

Hi Bernd,

Bernd Web wrote:
> Hi,
> 
> I'd like to run iep on a sequence and use either pir or osformat gifasta.
> The following gives an error (using emboss 5.0.0 on Debian):
> 
> iep -filter -osformat gifasta -sequence seq.txt
> This returns "Died: Unknown qualifier -osformat"

-osformat is for sequence outputs (and iep has no sequence outputs)

iep writes a plain text file as output and no special options
but we will add more information (accession and description) for a 
future release ... and to other plain text output files too.

> iep -filter -sformat pir seq.txt or iep -sformat pir -sequence seq.txt
> also give an error:
> "Died: iep terminated: Bad value for '-sequence' with -auto defined"
> (with or without the sequence flag)
> 
> However, iep -sformat fasta seq.txt works. What am I doing wrong?

It appears your sequence can be read in fasta format but not in pir 
format. PIR format has special characters after the first '>'

> My FastA definition line is e.g.
>> ENSG00000205090|1|protein_coding.
> The IEP report would me more useful if it contains the ENSG number
> instead of "protein coding or the entire definition line.

Not a nice format. NCBI made up a lot of FASTA file identifiers with '|' 
characters and we try to follow their rules. That causes us to ignore 
the first part (it should be a database name) and reas the ID from the end.

You could reformat the FASTA files (e.g. with a perl script) to remove 
the '|' characters and leave something useful as the plain ID (perhaps 
ENSG00000205090_1 in this case) and the rest as description.

Hope that helps,

Peter Rice