[emboss-dev] EMBOSS and its FASTA like alignment output

Peter biopython at maubp.freeserve.co.uk
Mon Aug 3 17:12:09 UTC 2009


On Mon, Aug 3, 2009 at 4:31 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
>
> Peter wrote:
>> Hi,
>>
>> One of the many things I talked to Peter Rice about in Sweden
>> was the Pearson FASTA like output from needle and water (e.g.
>> what EMBOSS calls the markx10 output format), and why it
>> includes the EMBOSS header and footer lines (which start with
>> a # character), which are not present in real FASTA output.
>>
>> Biopython can parse the pairwise -m 10 output from Bill
>> Pearson's FASTA tools, so in theory we (Biopython) should
>> be able to parse the markx10 output from EMBOSS needle
>> and water. We could probably cope with the extra header
>> and footer, but I think it would be best if EMBOSS could
>> produce something more closely matching the real FASTA
>> output. Unfortunately, it appears to be more than just the
>> headers which upset our parser - even ignoring them,
>> EMBOSS markx10 output still looks rather different to
>> (current) FASTA -m 10 output. Was the markx10 output
>> mimicking a particular (old) version of the FASTA tools?
>
> I have checked the latest FASTA3 and FASTA2 tools from
> Bill Pearson.
>
> What does BioPython expect as "markx10" and the other
> markx formats?

We only support the "-m 10" output format from the FASTA tools,
which is intended to be machine readable. i.e. what EMBOSS
tries to mimic with "markx10". So I am not worried about the
other markx formats that EMBOSS can produce.

> There are extra lines reporting equivalent data to the EMBOSS alignment
> headers which we could include, but I would like to know there is a
> parser that can accept them as markx* format in each case.
>
> In this case "more closely matching" may not be close enough :-)

Something by eye that looked "wrong" in the EMBOSS markx10
output concerns the ">" lines. In particular, I expect to see lines
starting "  1>>>identifier", " 2>>>identifier", ... to indicate the start
of each result set for each query. EMBOSS doesn't output these.

In the case of needle and water as things stand, you only ever
have one query sequence (although we have discussed a
"superneedle" and "superwater" as possible enhancements),
so there would only be one such line.

Beyond that, I'd have to dig a little deeper into our code, feeding
it EMBOSS markx10 output with the header/footer removed, and
see where it falls over. Things like the histogram are optional and
we ignore them anyway. I am happy to test patches (off the list
if you prefer).

(Although I would prioritise the FASTQ stuff first.)

Regards,

Peter C.



More information about the emboss-dev mailing list