[EMBOSS] FASTA format appears to get misrecognised as GCG

Peter Rice ricepeterm at yahoo.co.uk
Thu Mar 12 08:32:20 UTC 2015


Hi Jan,

On 11/03/2015 19:15, Jan Kim wrote:
> Dear All,
>
> I've just had "water" in EMBOSS 6.5.7.0 fail, and traced this back
> to the regular expression "CHECK: [0-9].*\.\." matching the header
> line of a FASTA file. The command
>
>      water -asequence b.fasta  [...]  -auto
>
> terminates with
>
>      Warning: Sequence 'gcg::b.fasta:broken' has zero length, ignored
>      Error: Unable to read sequence 'b.fasta'
>
> As a minimal demo, any sequence with the header
>
>      >broken CHECK: 0 ..
>
> causes the problem, and expressly stating the format (via "fasta::b.fasta"
> rather than just "b.fasta") fixes it.
>
> My speculation at this point is that somehow matching the regexp mentioned
> above causes the autodetection to identify the format as GCG rather than
> as FASTA.

That is what I would expect. We test for GCG format first, which 
requires scanning through a possibly long header looking for a checksum 
line, and then testing whether we can read as GCG.

It is supposed to continue trying other formats if GCG format fails. We 
can try variations on your 'broken' FASTA format and make sure EMBOSS 
can read it as FASTA in future.

The problem arises from legacy interpretation of GCG  format. GCG had a 
program called 'reformat' that would correct the check: line after 
editing, so EMBOSS tried to replicate this by reading even if no length 
was found. I do not believe anyone is depending on this fesature now, so 
we can safely also check for a length: value and use that.

> This doesn't exactly match my expectations based on the USA specs [1],
> according to which EMBOSS expects FASTA by default and will try other
> formats only if that doesn't work. (I have some inkling that this
> default can be configured somewhere, but I haven't found anything
> suspicious in /usr/local/share/EMBOSS and a quick scan didn't turn up
> any stray .embossrc files either.)

Ah, perhaps we could rephrase that. If no format is specified, EMBOSS 
tries all possible formats.

Even FASTA format is complicated - especially how EMBOSS reads the ID. 
There are various versions of FASTA format where EMBOSS can read an 
NCBI/Blast style ID ('ncbi' format) or use whatever is there without 
trying to parse of clean it up ('pearson' format) which you have to 
specify explicitly.

The default format can be configures by setting environment varaible 
EMBOSS_FORMAT but using fasta:: in the USA, or following it with 
-sformat fasta (or -sf fasta) is the usual way.

> As a bit of background, this happened in an "embedded script", and the
> regexp was right in the sense that stuff from a GCG (or similar) formatted
> file had found its way into the FASTA header. I hope I fixed my script
> now by expressly stating the format; this posting is to solicit comments
> regarding whether I've done something wrong / stupid (and possibly to
> leave some hints regarding this matter in the mailing list archives...).
> [1] http://emboss.sourceforge.net/docs/themes/UniformSequenceAddress.html

Many thanks for pointing this out. It will be cleaned up in a future 
version (I just tested the change) and we will revise the description on 
the website.

regards,

Peter Rice
EMBOSS Team


More information about the EMBOSS mailing list