[Biopython-dev] Changing Seq equality

Wed Nov 25 13:15:25 UTC 2009

On Wed, Nov 25, 2009 at 12:53 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi all;
> Interesting discussion on the equality issue.
>
>> Dividing alphabets into those four groups would imply:
>>
>> "ACG" == Seq("ACG") == Seq("ACG", generic_nucleotide)
>> "ACG" != Seq("ACG", generic_rna)
>> "ACG" != Seq("ACG", generic_dna)
>> "ACG" != Seq("ACG", generic_protein)
>> ...
>> Seq("ACG") != Seq("ACG", generic_protein)
>>
>> This has some non-intuitive behaviour. Also it doesn't take
>> into account a number of corner cases (which could be better
>> handled in the existing Seq objects I admit) - things like
>> secondary structure alphabets (e.g. for proteins: coils, beta
>> sheet, alpha helix) or reduced alphabets? (e.g. for proteins
>> using Aliphatic/Aromatic/Charged/Tiny/Diverse, or any of
>> the Murphy (2000) tables).
>
> Instead of considering the most horrible edge cases, we should think
> about the most common use cases and make those easy. Alphabets are a
> bit overcomplicated and in practice are probably not being used to
> represent these other potential alphabets. I may be simple minded in
> my programming, but have never seen the benefit of directly encoding
> anything more complicated that DNA, RNA or proteins. The 3 things
> I've used alphabets for are:
>
> - Is it DNA, RNA or protein?
> - Does a sequence match the alphabet? Checking input files.
> - Being careful not to add DNA and protein. In practice, I don't
>  really do this very often.

Me too - but fixing Bug 2597 would really help (either an
exception or a warning would be a big improvement).

>> We could consider a modified version of the string identity
>> approach - make seq1==seq2 act as str(seq1)==str(seq2),
>> but *also* look at the alphabets and if they are incompatible
>> (using the existing rules used in addition etc) raise a Python
>> warning. Right now this seems like quite a tempting idea to
>> explore...
>
> I like this with Jose's cases for the standard DNA, RNA, protein and
> generic alphabets. So provide sequence + alphabet checking for
> all of the common cases, and a warning plus just sequence checking
> for the edge cases. So if you try and compare a DNA sequence and
> your secondary structure alphabet, you will get a mismatch on the
> sequences and a warning about incompatible alphabets.

You seem to be suggesting some hybrid plan here Brad - I don't
quite follow you. Could you clarify (e.g. with some examples)?

In the mean time, I'll work on a patch to do my suggestion of
hashing and comparison based on string comparison, but with
alphabet aware warnings.

Peter