[emboss-dev] EMBOSS and its FASTA like alignment output

Tue Jul 21 13:05:35 UTC 2009

Hi all,

I've CC'd the Biopython-dev mailing list as this EMBOSS
thread is becoming cross project.

On Tue, Jul 21, 2009 at 1:06 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
>
> Peter wrote:
>> Hi,
>>
>> One of the many things I talked to Peter Rice about in Sweden
>> was the Pearson FASTA like output from needle and water (e.g.
>> what EMBOSS calls the markx10 output format), and why it
>> includes the EMBOSS header and footer lines (which start with
>> a # character), which are not present in real FASTA output.
>>
>> Biopython can parse the pairwise -m 10 output from Bill
>> Pearson's FASTA tools, so in theory we (Biopython) should
>> be able to parse the markx10 output from EMBOSS needle
>> and water. We could probably cope with the extra header
>> and footer, but I think it would be best if EMBOSS could
>> produce something more closely matching the real FASTA
>> output. Unfortunately, it appears to be more than just the
>> headers which upset our parser - even ignoring them,
>> EMBOSS markx10 output still looks rather different to
>> (current) FASTA -m 10 output. Was the markx10 output
>> mimicking a particular (old) version of the FASTA tools?
>
> The source code documentation refers to FASTA 3.4 which
> may be the last time I took a detailed look at the FASTA
> alignment outputs.

That might explain it - I've been using FASTA 3.5.

> Can you send us some example files so we can check for
> the significant differences?

Sure. There are half a dozen FASTA -m 10 output files here:
http://biopython.open-bio.org/SRC/biopython/Tests/Fasta/

> We plan to install all the bio* projects so it would be helpful
> to have a set of biopython parser scripts we can use to test
> locally. We can add them to our routine QA tests and flag up
> changes as soon as they appear.

If you have (the latest) Biopython installed, and periodically
run the unit tests (in particular, test_Emboss.py), that would
be a good start. Right now I know that unit test works with
EMBOSS 4.0.0 and 6.0.1 (which happens to be on two of
the machines I use for testing), and mostly works with
EMBOSS 6.1.0 (everything except the GenBank regression
you were just looking into today).

I'm considering extending test_Emboss.py in the future to
take advantage of the new features in EMBOSS 6.1.0
onwards such as GFF and FASTQ support, or perhaps
having a second test script (which will be conditional on
the version of EMBOSS installed).

>> Peter R. did say it would be simple to turn off this header and
>> footer output, so I thought I would try this myself. It looks like
>> this is handled in file ajax/ajalign.c by function alignWriteMark,
>> but I don't see a switch to disable the headers and footers.
>
> You correctly found how to turn off the header. The footer is
> reported for anything except pure sequence output.
>
> For the next release I will add attributes to the list of alignment
> formats to say whether the header and footer are needed. That
> will allow us better control and reporting.
>
> Meanwhile, we are very happy to standardise the markx* outputs
> to make them easier to parse. Biopython is the first project to
> report problems with this. There are alternatives - specifying
> -aformat and using some other alignment format for all
> applications - but we like to conform and  will do our best to fir
> what parsers expect.
>
> Also, of course, once we know we are being parsed we will do
> our best not to let the output change.

This isn't really a problem. Biopython can read EMBOSS's own
alignment formats (pairs and simple), so there is little need for
us to be able to parse EMBOSS's version of the FASTA output.
[Although at the moment we ignore all the header information,
if that formatting will be consistent, we could parse it too.]

However, at least one person wanted to parse EMBOSS
markx10 output strongly enough that he wrote a modified
version of our FASTA -m 10 parser. I would rather however
have EMBOSS revise its output to better match FASTA.
See http://bugzilla.open-bio.org/show_bug.cgi?id=2704

Peter C.