[Biopython-dev] Alphabet bug in Bio.Motif and Bio.motifs

Wed Jun 5 10:29:58 UTC 2013

On Wed, Jun 5, 2013 at 11:12 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> I wouldn't want to subclass sets due to the fact that in many
>> existing uses of the alphabets the order of the letters is
>> important (and this is not specified in a Python set).
>
> OK, then indeed a set wouldn't be appropriate.
>
>> But I agree that a rationalised alphabet system like that could
>> work better. Here equality testing could be on both being the
>> same type, e.g. DNA, and having the same letters - including
>> special letters for gaps or stop codons (which are the nastiest
>> part of the current alphabet object system)?
>
> I guess that it depends on how the alphabet is used. For example, for
> the example in the bug report the order of the letters doesn't matter,
> but for other cases it may matter.

What is the motif class doing that restricts it to IUPAC
unambiguous DNA? Rather than any DNA alphabet, such
as ambiguous DNA, or mixed case sequences?

> Personally I almost never use
> alphabets. Can anybody give some real-life examples of how they
> are used?

The generic aim is to label Seq objects as either DNA, RNA or
protein (and restrict operations like additions or translation
accordingly). That doesn't need the letter level information.

Validating that sequences use the expected letters only (e.g. if
sending to a tool which does not understand U as a protein,
or if writing to a restricted file format). I think the NEXUS code
has this kind of constraint.

Counting amino acid or nucleotide frequencies - even if your
example proteins happens to lack proline, you'd probably want
to consider it in your list of amino acids. Depending on your
data structure that could be important (while a consistent
order may or may not matter, e.g. array indexing).

Peter