[EMBOSS] FASTA format appears to get misrecognised as GCG
Jan Kim
jttkim at googlemail.com
Thu Mar 12 17:25:15 UTC 2015
Dear Peter,
thanks for your comments -- good to know that my speculations were
largely correct and that prefixing files with fasta:: is a reliable
fix.
Best regards, Jan
On Thu, Mar 12, 2015 at 08:32:20AM +0000, Peter Rice wrote:
> Hi Jan,
>
> On 11/03/2015 19:15, Jan Kim wrote:
> >Dear All,
> >
> >I've just had "water" in EMBOSS 6.5.7.0 fail, and traced this back
> >to the regular expression "CHECK: [0-9].*\.\." matching the header
> >line of a FASTA file. The command
> >
> > water -asequence b.fasta [...] -auto
> >
> >terminates with
> >
> > Warning: Sequence 'gcg::b.fasta:broken' has zero length, ignored
> > Error: Unable to read sequence 'b.fasta'
> >
> >As a minimal demo, any sequence with the header
> >
> > >broken CHECK: 0 ..
> >
> >causes the problem, and expressly stating the format (via "fasta::b.fasta"
> >rather than just "b.fasta") fixes it.
> >
> >My speculation at this point is that somehow matching the regexp mentioned
> >above causes the autodetection to identify the format as GCG rather than
> >as FASTA.
>
> That is what I would expect. We test for GCG format first, which
> requires scanning through a possibly long header looking for a
> checksum line, and then testing whether we can read as GCG.
>
> It is supposed to continue trying other formats if GCG format fails.
> We can try variations on your 'broken' FASTA format and make sure
> EMBOSS can read it as FASTA in future.
>
> The problem arises from legacy interpretation of GCG format. GCG
> had a program called 'reformat' that would correct the check: line
> after editing, so EMBOSS tried to replicate this by reading even if
> no length was found. I do not believe anyone is depending on this
> fesature now, so we can safely also check for a length: value and
> use that.
>
> >This doesn't exactly match my expectations based on the USA specs [1],
> >according to which EMBOSS expects FASTA by default and will try other
> >formats only if that doesn't work. (I have some inkling that this
> >default can be configured somewhere, but I haven't found anything
> >suspicious in /usr/local/share/EMBOSS and a quick scan didn't turn up
> >any stray .embossrc files either.)
>
> Ah, perhaps we could rephrase that. If no format is specified,
> EMBOSS tries all possible formats.
>
> Even FASTA format is complicated - especially how EMBOSS reads the
> ID. There are various versions of FASTA format where EMBOSS can read
> an NCBI/Blast style ID ('ncbi' format) or use whatever is there
> without trying to parse of clean it up ('pearson' format) which you
> have to specify explicitly.
>
> The default format can be configures by setting environment varaible
> EMBOSS_FORMAT but using fasta:: in the USA, or following it with
> -sformat fasta (or -sf fasta) is the usual way.
>
> >As a bit of background, this happened in an "embedded script", and the
> >regexp was right in the sense that stuff from a GCG (or similar) formatted
> >file had found its way into the FASTA header. I hope I fixed my script
> >now by expressly stating the format; this posting is to solicit comments
> >regarding whether I've done something wrong / stupid (and possibly to
> >leave some hints regarding this matter in the mailing list archives...).
> >[1] http://emboss.sourceforge.net/docs/themes/UniformSequenceAddress.html
>
> Many thanks for pointing this out. It will be cleaned up in a future
> version (I just tested the change) and we will revise the
> description on the website.
>
> regards,
>
> Peter Rice
> EMBOSS Team
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/emboss
--
+- Jan T. Kim -------------------------------------------------------+
| email: jttkim at gmail.com |
| WWW: http://www.jtkim.dreamhosters.com/ |
*-----=< hierarchical systems are for files, not for humans >=-----*
More information about the EMBOSS
mailing list