[Biopython-dev] Changing Seq equality

Peter biopython at maubp.freeserve.co.uk
Thu Nov 26 10:41:10 UTC 2009


On Thu, Nov 26, 2009 at 7:14 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> On Wed, Nov 25, 2009 at 6:48 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Doing anything complex with alphabets may fall into the "hard
>> to explain" category. Using object identity or string identity is
>> at least simple to explain.
>>
>> Thus far we have just two options, and neither is ideal:
>> (a) Object identity, following id(seq1)==id(seq2) as now
>> (b) String identity, following str(seq1)==str(seq2)
>
> How about (c), string and generic alphabet identity, where
> Seq.__hash__ uses the sequence string and some simplification of the
> alphabets types like Jose described. Premise: the sequence string and
> alphabet are the only arguments the Seq constructor takes, so if two
> objects can both be recreated from the same arguments, they should be
> equal as far as sets and dictionaries are concerned. To fall back on
> string identity, it's easy enough to map str onto a collection of Seq
> objects.
>
> def __hash__(self):
>    """Same string, same alphabet --> same hash."""
>    # If alphabet is a standard type, match the generic alphabet types
>    if self.alphabet == generic_nucleotide:
>        return hash(str(self), Alphabet)
>        #OR, to match raw strings: return hash(str(self))
>    elif isinstance(self.alphabet, DNAAlphabet):
>        return hash((str(self), DNAAlphabet))
>    elif isinstance(self.alphabet, RNAAlphabet):
>        return hash((str(self), RNAAlphabet))
>    elif isinstance(self.alphabet, ProteinAlphabet):
>        return hash((str(self), ProteinAlphabet))
>    # Other alphabets, maybe user-defined --> require exactly the same type
>    else:
>        return hash((str(self), self.alphabet.__class__))

As an aside, you'd need to get the base alphabet (i.e. remove any
AlphabetEncoder wrappers) to decide if it is RNA/DNA/Protein.
There is a private helper function in Bio.Alphabet for this. I don't
think these AlphabetEncoder objects (like Gapped) were an
entirely sensible design... but its done now.

This idea (c) has a major drawback for me, in that it appears you
wouldn't support comparing Seq objects to strings. However,
perhaps that is actually a good thing - that could raise a TypeError,
to force the user to do str(my_seq) == "ACG" which is explicit.

As I understood his proposal, in Jose's related idea (which didn't
get assigned a letter yet), "ACG"==Seq("ACG") would hold for the
default generic alphabet, but for not for RNA/DNA/Protein. e.g.
"ACG"!=Seq("ACG",generic_dna), which I find very counter
intuitive.

Peter




More information about the Biopython-dev mailing list