[EMBOSS] FASTA format appears to get misrecognised as GCG
Jan Kim
jttkim at googlemail.com
Wed Mar 11 19:15:41 UTC 2015
Dear All,
I've just had "water" in EMBOSS 6.5.7.0 fail, and traced this back
to the regular expression "CHECK: [0-9].*\.\." matching the header
line of a FASTA file. The command
water -asequence b.fasta [...] -auto
terminates with
Warning: Sequence 'gcg::b.fasta:broken' has zero length, ignored
Error: Unable to read sequence 'b.fasta'
As a minimal demo, any sequence with the header
>broken CHECK: 0 ..
causes the problem, and expressly stating the format (via "fasta::b.fasta"
rather than just "b.fasta") fixes it.
My speculation at this point is that somehow matching the regexp mentioned
above causes the autodetection to identify the format as GCG rather than
as FASTA.
This doesn't exactly match my expectations based on the USA specs [1],
according to which EMBOSS expects FASTA by default and will try other
formats only if that doesn't work. (I have some inkling that this
default can be configured somewhere, but I haven't found anything
suspicious in /usr/local/share/EMBOSS and a quick scan didn't turn up
any stray .embossrc files either.)
As a bit of background, this happened in an "embedded script", and the
regexp was right in the sense that stuff from a GCG (or similar) formatted
file had found its way into the FASTA header. I hope I fixed my script
now by expressly stating the format; this posting is to solicit comments
regarding whether I've done something wrong / stupid (and possibly to
leave some hints regarding this matter in the mailing list archives...).
Best regards, Jan
[1] http://emboss.sourceforge.net/docs/themes/UniformSequenceAddress.html
--
+- Jan T. Kim -------------------------------------------------------+
| email: jttkim at gmail.com |
| WWW: http://www.jtkim.dreamhosters.com/ |
*-----=< hierarchical systems are for files, not for humans >=-----*
More information about the EMBOSS
mailing list