[EMBOSS] FASTA format appears to get misrecognised as GCG
ricepeterm at yahoo.co.uk
Thu Mar 12 08:32:20 UTC 2015
On 11/03/2015 19:15, Jan Kim wrote:
> Dear All,
> I've just had "water" in EMBOSS 22.214.171.124 fail, and traced this back
> to the regular expression "CHECK: [0-9].*\.\." matching the header
> line of a FASTA file. The command
> water -asequence b.fasta [...] -auto
> terminates with
> Warning: Sequence 'gcg::b.fasta:broken' has zero length, ignored
> Error: Unable to read sequence 'b.fasta'
> As a minimal demo, any sequence with the header
> >broken CHECK: 0 ..
> causes the problem, and expressly stating the format (via "fasta::b.fasta"
> rather than just "b.fasta") fixes it.
> My speculation at this point is that somehow matching the regexp mentioned
> above causes the autodetection to identify the format as GCG rather than
> as FASTA.
That is what I would expect. We test for GCG format first, which
requires scanning through a possibly long header looking for a checksum
line, and then testing whether we can read as GCG.
It is supposed to continue trying other formats if GCG format fails. We
can try variations on your 'broken' FASTA format and make sure EMBOSS
can read it as FASTA in future.
The problem arises from legacy interpretation of GCG format. GCG had a
program called 'reformat' that would correct the check: line after
editing, so EMBOSS tried to replicate this by reading even if no length
was found. I do not believe anyone is depending on this fesature now, so
we can safely also check for a length: value and use that.
> This doesn't exactly match my expectations based on the USA specs ,
> according to which EMBOSS expects FASTA by default and will try other
> formats only if that doesn't work. (I have some inkling that this
> default can be configured somewhere, but I haven't found anything
> suspicious in /usr/local/share/EMBOSS and a quick scan didn't turn up
> any stray .embossrc files either.)
Ah, perhaps we could rephrase that. If no format is specified, EMBOSS
tries all possible formats.
Even FASTA format is complicated - especially how EMBOSS reads the ID.
There are various versions of FASTA format where EMBOSS can read an
NCBI/Blast style ID ('ncbi' format) or use whatever is there without
trying to parse of clean it up ('pearson' format) which you have to
The default format can be configures by setting environment varaible
EMBOSS_FORMAT but using fasta:: in the USA, or following it with
-sformat fasta (or -sf fasta) is the usual way.
> As a bit of background, this happened in an "embedded script", and the
> regexp was right in the sense that stuff from a GCG (or similar) formatted
> file had found its way into the FASTA header. I hope I fixed my script
> now by expressly stating the format; this posting is to solicit comments
> regarding whether I've done something wrong / stupid (and possibly to
> leave some hints regarding this matter in the mailing list archives...).
>  http://emboss.sourceforge.net/docs/themes/UniformSequenceAddress.html
Many thanks for pointing this out. It will be cleaned up in a future
version (I just tested the change) and we will revise the description on
More information about the EMBOSS