[Biopython-dev] Changing Seq equality

Peter biopython at maubp.freeserve.co.uk
Wed Nov 25 11:48:16 UTC 2009


On Wed, Nov 25, 2009 at 11:20 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
>
> That's a tricky issue. I think that the desired behaviour should be defined
> and after that the implementation should go.
>

Many desired behaviours are mutually contradictory given the way
Python works, and the current Seq/Alphabet objects. One can come
up many possible desired behaviours, but often they are not coherent
or not technically possible.

> One possible solution would be
> to consider the generic alphabet different than the more specific ones and
> consider the str as having a generic alphabet. It would be something like:
>
> GenericAlphabet=0, DNAAlphabet=1, RNAAlphabet=2, ProteinAlphabet=3
> if str:
>    alphabet=generic
> else:
>    alphabet=seq.alphabet
> return str(seq1) + str(alphabet) == str(seq2) + str(alphabet)

Dividing alphabets into those four groups would imply:

"ACG" == Seq("ACG") == Seq("ACG", generic_nucleotide)
"ACG" != Seq("ACG", generic_rna)
"ACG" != Seq("ACG", generic_dna)
"ACG" != Seq("ACG", generic_protein)
...
Seq("ACG") != Seq("ACG", generic_protein)

This has some non-intuitive behaviour. Also it doesn't take
into account a number of corner cases (which could be better
handled in the existing Seq objects I admit) - things like
secondary structure alphabets (e.g. for proteins: coils, beta
sheet, alpha helix) or reduced alphabets? (e.g. for proteins
using Aliphatic/Aromatic/Charged/Tiny/Diverse, or any of
the Murphy (2000) tables).

The whole issue is horribly complicated! Quoting "Zen of Python":

* If the implementation is hard to explain, it's a bad idea.
* If the implementation is easy to explain, it may be a good idea.

Doing anything complex with alphabets may fall into the "hard
to explain" category. Using object identity or string identity is
at least simple to explain.

Thus far we have just two options, and neither is ideal:
(a) Object identity, following id(seq1)==id(seq2) as now
(b) String identity, following str(seq1)==str(seq2)

We could consider a modified version of the string identity
approach - make seq1==seq2 act as str(seq1)==str(seq2),
but *also* look at the alphabets and if they are incompatible
(using the existing rules used in addition etc) raise a Python
warning. Right now this seems like quite a tempting idea to
explore...

Peter




More information about the Biopython-dev mailing list