[Biopython-dev] Alignment columns as strings or Seq objects?

Eric Talevich eric.talevich at gmail.com
Fri May 14 02:28:13 UTC 2010


Here's another +1 for plain strings. I agree with Michiel, and if the user
really needs to rebuild a Seq with the original alphabet, it's not too
difficult to fetch that information from the original alignment object.

-Eric

On Thu, May 13, 2010 at 5:29 PM, Michiel de Hoon <mjldehoon at yahoo.com>wrote:

> I would definitely use a plain string. A Seq object suggests that we're
> dealing with a real biological sequence, which a column in the alignment
> matrix is not. The only advantage of having a Seq object is that it has an
> alphabet associated with it. But alphabets are very rarely used in practice,
> if at all. Reverse complementing or (back-)transcribing are available in the
> Bio.Seq module as functions that can operate on plain strings, so we don't
> need a Seq object for that.
>
> --Michiel.
>
> --- On Thu, 5/13/10, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> > From: Peter <biopython at maubp.freeserve.co.uk>
> > Subject: [Biopython-dev] Alignment columns as strings or Seq objects?
> > To: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
> > Date: Thursday, May 13, 2010, 7:47 AM
> > Peter wrote:
> > > Hello all,
> > >
> > > Are there any outstanding issues we should address
> > before making
> > > the Biopython 1.54 release?
> > >
> > > ...
> > >
> > > One thing I am wondering about is making column
> > extraction in
> > > the new alignment object return a string rather than a
> > Seq object.
> > > I'll start another thread on this issue...
> >
> > I remember we debated this a bit before but can't find the
> > thread right now. See also Bug 3066 where I am proposing
> > to add methods to iterate over the rows or columns as
> > strings.
> > http://bugzilla.open-bio.org/show_bug.cgi?id=3066
> >
> > The main benefit of using a plain string when extracting
> > the
> > alignment columns is speed. Because the data is stored by
> > row, each time we extract a column we would have to build
> > a new instance of the Seq object. For large alignments
> > (and
> > thinking ahead to next-gen alignment objects) this could
> > be
> > a painful overhead.
> >
> > Because the whole alignment has an alphabet, we can use
> > this
> > to assign an alphabet to a column sequence. Note that the
> > rows
> > of the alignments could have slightly different alphabets.
> > So it
> > is possible (and the current code does this) to generate a
> > Seq
> > object with a meaningful alphabet from a column.
> >
> > Why is this useful? Other than the alphabet, the main
> > benefit
> > of using a Seq object is consistency. On a practical level,
> > the
> > Seq object's biological translate method isn't appropriate
> > at all
> > for an alignment column. On the other hand, one might
> > possibly
> > want to use (back)transcribe to flip between DNA and RNA,
> > and maybe even take the complement.
> >
> > Are there any strong views here on how alignment slicing
> > to
> > get a column should behave? i.e. should align[:,9] return
> > the
> > column as a string or as a Seq?
> >
> > Peter
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython-dev
> >
>
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>



More information about the Biopython-dev mailing list