[BioPython] Alignment class

Thu Feb 7 14:59:46 UTC 2008

On Feb 7, 2008 2:15 PM, Jan Kosinski <kosa at genesilico.pl> wrote:
> Peter wrote:
> > The whole idea behind the current alignment class is that all the
> > sequences are the same length (often with gaps).
> I was always wondering what is the reason that you made the alignment
> class which requires all sequences have the same length (even if incl.
> gaps)?

The design of the current alignment class predates my involvement, but
from the point of view of the code (and the column access in
particular) it assumes the sequences have the same length.  This
assumption (with leading/trailing gaps) is also common to all the
alignment file formats I have worked with.  I like this abstraction as
you can regard the alignment as an array of characters (using matrix
notation or what ever).

I can see that the EST alignment case is a little different, in that
by convention the leading/trailing "gaps" are not shown.  It would be
possible to write an new EST class which stored the sequences without
leading/trailing "gap"s, but took into account the start offset, and
would allow access to the "columns" inserting leading/trailing gaps
where a given sequence has not started or has already finished.  I
don't see that this would be any more useful (except perhaps for a
small memory saving)

In general leading/trailing gaps can mean the limits of a gene, or the
limit of a domain with an gene, or the limits of a sequenced fragment,
etc.  Sometimes there really is no character to go there, in other
cases the sequence concerns does continue but for whatever reason it
was not included in the alignment.

One possibility (depending on what you want to do with the alignment)
is to use different characters for internal gaps, leading "gaps" and
trailing "gaps".

Peter