[Biopython-dev] MAF Parser/Writer/Indexer

Peter Cock p.j.a.cock at googlemail.com
Sat May 14 11:30:07 UTC 2011


Hi Andrews,

I've had a look at those example files you linked to now.

On Fri, May 13, 2011 at 10:26 PM, Andrew Sczesnak  wrote:
> Hi All,
> The value of this format to most users will come from the ability to extract
> sequences from an arbitrary number of species that align to a particular
> sequence range in a particular genome, at random.  We should be able to
> say, report the alignment of 50 genomes to the human HOX locus fairly
> quickly (say <1s).  An iterator and writer class will certainly be useful,
> but to implement the aforementioned functionality, some API changes are
> probably necessary.

I had previously considered a cross-format Bio.AlignIO index on
alignment number (i.e. 0, 1, 2, ... n-1 if the file contains n
alignments). That would work on PHYLIP, Stockholm, Clustalw, etc, even
FASTA if your alignment all have the same number of entries. It could
also be used with MAF. However, I don't think it is useful. Of the
current file formats supported in AlignIO, in my experience only
PHYLIP files regularly contain more than one alignment, and since
these are used for bootstrapping random access is not required
(iteration is enough). And presumably for MAF, there is no reason to
want to access the alignments by this index number either.

With something like SAM/BAM (or other assembly formats like ACE or the
MIRA alignment format also called MAF), you can have multiple
alignments (the contigs or chromosomes) each with many entries
(supporting reads). Here there is a clear single reference coordinate
system, that of the (gapped) reference contigs/chromosomes. This also
means each alignment has a clear name (the name of the reference
contig/chromosome), so this name and coordinates can be used for
indexing (as in samtools).

With MAF however, things are not so easy - any of the sequences could
be used as a reference (e.g. human chr 1, or mouse chr 2), and any
region of a sequence might be in more than one alignment.

I'm beginning to suspect what Andrew has in mind is going to be MAF
specific - so it won't be top level functionality in Bio.AlignIO, but
rather tucked away in Bio.AlignIO.MafIO instead.

Peter




More information about the Biopython-dev mailing list