[Biopython-dev] Changing Seq equality

Eric Talevich eric.talevich at gmail.com
Thu Nov 26 07:14:08 UTC 2009


On Wed, Nov 25, 2009 at 6:48 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Nov 25, 2009 at 11:20 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
>>
>> That's a tricky issue. I think that the desired behaviour should be defined
>> and after that the implementation should go.
>>
>
> Many desired behaviours are mutually contradictory given the way
> Python works, and the current Seq/Alphabet objects. One can come
> up many possible desired behaviours, but often they are not coherent
> or not technically possible.
>
>> One possible solution would be
>> to consider the generic alphabet different than the more specific ones and
>> consider the str as having a generic alphabet. It would be something like:
>>
>> GenericAlphabet=0, DNAAlphabet=1, RNAAlphabet=2, ProteinAlphabet=3
>> if str:
>>    alphabet=generic
>> else:
>>    alphabet=seq.alphabet
>> return str(seq1) + str(alphabet) == str(seq2) + str(alphabet)
>
> [...]
>
> The whole issue is horribly complicated! Quoting "Zen of Python":
>
> * If the implementation is hard to explain, it's a bad idea.
> * If the implementation is easy to explain, it may be a good idea.
>
> Doing anything complex with alphabets may fall into the "hard
> to explain" category. Using object identity or string identity is
> at least simple to explain.
>
> Thus far we have just two options, and neither is ideal:
> (a) Object identity, following id(seq1)==id(seq2) as now
> (b) String identity, following str(seq1)==str(seq2)

How about (c), string and generic alphabet identity, where
Seq.__hash__ uses the sequence string and some simplification of the
alphabets types like Jose described. Premise: the sequence string and
alphabet are the only arguments the Seq constructor takes, so if two
objects can both be recreated from the same arguments, they should be
equal as far as sets and dictionaries are concerned. To fall back on
string identity, it's easy enough to map str onto a collection of Seq
objects.

def __hash__(self):
    """Same string, same alphabet --> same hash."""
    # If alphabet is a standard type, match the generic alphabet types
    if self.alphabet == generic_nucleotide:
        return hash(str(self), Alphabet)
        #OR, to match raw strings: return hash(str(self))
    elif isinstance(self.alphabet, DNAAlphabet):
        return hash((str(self), DNAAlphabet))
    elif isinstance(self.alphabet, RNAAlphabet):
        return hash((str(self), RNAAlphabet))
    elif isinstance(self.alphabet, ProteinAlphabet):
        return hash((str(self), ProteinAlphabet))
    # Other alphabets, maybe user-defined --> require exactly the same type
    else:
        return hash((str(self), self.alphabet.__class__))


Cheers,
Eric




More information about the Biopython-dev mailing list