[Biopython-dev] [Bug 1944] Align.Generic adding iterator and more

Sun Aug 19 16:01:00 UTC 2007

http://bugzilla.open-bio.org/show_bug.cgi?id=1944

------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2007-08-19 12:01 EST -------
I thought returning Seq objects rather than strings might be contentious *grin*

>From an ideological point of view, returning strings undermines the use of the
Seq object in the first place.

> A Seq is a string with an alphabet attached. I think it is not
> advisable to require that all sequences in an alignment have the
> same alphabet.

We don't have to require this.  The alignment as a whole should have an
alphabet even if it is the lowest common denominator (like the generic single
letter alphabet).

It would be reasonable for the user to create a "generic protein" alignment
where some of the SeqRecords have a more precise alphabet such as IUPACProtein. 

Or, someone might have a "generic nucleotide" alignment where some SeqRecords
are DNA and other RNA (this is a bit odd).

> For example, one sequence may be IUPACUnambiguousDNA, another one
> IUPACAmbiguousDNA.

That would be fine - In this case the user should construct their alignment
with any of IUPACAmbiguousDNA, generic DNA, generic nucleotide or even generic
single letter.

> Or, one is IUPACProtein, and an another one the generic Alphabet because
> the user did not explicitly specify the alphabet when creating the Seq object.

In this example, the only sensible choice of alphabet for the whole alignment
would be a generic one.

> I don't see anything fundamentally wrong with that.

Neither do I. Its nicer (and probably normal) to have all the sequences in the
same alignment with the same alphabet, but not essential.

> So, if we cannot guarantee that all rows in the alignment have the same
> alphabet, then we cannot really return a column of the alignment as a Seq
> -- we won't know the appropriate alphabet.

But we DO know an appropriate alphabet - whatever was specified for the entire
aligment (even if this is the generic single letter alphabet). So in the patch
I used that for any column or part column.

For any given row or part row, we can take the specific alphabet of the
associated SeqRecord (which may be more specific than the alphabet defined for
the whole alignment).

> From this viewpoint, align[:,c] or align[r1:r2,c] returning a string seems
> more natural, and then I'd expect align[r,:] or align[r,c1:c2] also to
> return a string.

You haven't convinced me.

Note that at the moment, when an alignment is created "by hand", you must
specify an alphabet (defaulting to the generic single letter alphabet would be
reasonable). The add sequence method currently only takes strings, so all the
SeqRecords will be created with the same alphabet as specified for the whole
alignment.

I think the suggested append() method should accept SeqRecords, provided their
alphabet matches that of the alignment or is a subclass of the alignment's
alphabet.  Using the SeqIO.to_alignment() function or otherwise assigning
SeqRecords directly to the alignment._records private list this can be
overcome.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.