[Biopython-dev] MAF Parser/Writer/Indexer

Peter Cock p.j.a.cock at googlemail.com
Sun May 15 20:24:21 UTC 2011


On Sun, May 15, 2011 at 8:59 PM, Sczesnak, Andrew wrote:
>> With something like SAM/BAM (or other assembly formats like ACE or the
>> MIRA alignment format also called MAF), you can have multiple
>> alignments (the contigs or chromosomes) each with many entries
>> (supporting reads). Here there is a clear single reference coordinate
>> system, that of the (gapped) reference contigs/chromosomes. This also
>> means each alignment has a clear name (the name of the reference
>> contig/chromosome), so this name and coordinates can be used for
>> indexing (as in samtools).
>>
>> With MAF however, things are not so easy - any of the sequences could
>> be used as a reference (e.g. human chr 1, or mouse chr 2), and any
>> region of a sequence might be in more than one alignment.
>>
>> I'm beginning to suspect what Andrew has in mind is going to be MAF
>> specific - so it won't be top level functionality in Bio.AlignIO, but
>> rather tucked away in Bio.AlignIO.MafIO instead.
>>
>> Peter
>
> I agree, the fact that this particular format does not explicitly define the
> reference sequence is problematic.  Based on the spec, we ought to be
> prepared for a multiz MAF file with several different reference sequences.
> However, practically speaking, the files out there in the world _do_ have a
> reference sequence, which appears in all alignments and is the first listed
> sequence.

That may be a very useful simplifying assumption. Would you expect
each position on the reference to appear in one and only one alignment
block in the MAF file? Or, might a given region appear in multiple
blocks?

> While I think there is definitely some trickyness to how this
> parser will have to interact with any API, my feeling is that these portions
> ought to be confined to MafIO, while a more general API lives in AlignIO or
> elsewhere.
>
> This isn't much different from a format like SFF, I think.
>

What did you mean here? SFF is just another sequence file format as
far as Bio.SeqIO goes, other than being binary it isn't exceptional.

Peter




More information about the Biopython-dev mailing list