[Biopython-dev] Changing Seq equality

Tue Nov 24 11:30:04 UTC 2009

Dear all,

One thing about the Seq object that still annoys me, and is
rather confusing for novices, is the equality testing. It would
be nice to "fix" this, but it turns out to be quite complicated
due to the way Python works. Brad and I did started talking
about this a few months ago at BOSC2009, but ran out of
time.

First, a brief aside about hashes (used in dictionaries and
sets). In Python immutable objects can be hashed, via
the hash function or a custom __hash__ method. An
important detail is that if two objects evaluate as equal,
they must have the same hash, and vice verse (otherwise
dictionaries break and other bad things happen). e.g.

>>> hash(1)
1
>>> hash(1.0)
1
>>> hash("1")
1977051568
>>> "1"==1
False
>>> 1.0==1
True

See also:
http://mail.python.org/pipermail/python-dev/2002-December/031455.html

In Biopython, the Seq object is immutable (read only) and
can be used as a dictionary tree. However, we don't
implement equality or hashes explicitly, thus get the object
default behaviour. This means two Seq objects are only
equal if they are the same object in memory. The hash
is actually the address in memory:

>>> from Bio.Seq import Seq
>>> s = Seq("ACGT")
>>> id(s)
532624
>>> hash(s)
532624

This means that while a Seq can be used as a dictionary
key, the test is for object equality - which is of limited use.

Now, the MutableSeq has an "alphabet aware" equality
defined. Because these are mutable objects, they don't
have a hash, and cannot be used as dictionary keys.
This means there are no hash related restrictions on
the equality rules. Now, what if the Seq object had a
similar "alphabet aware" equality?

The problem is if we'd like Seq("ACGT") to be equal to
Seq("ACGT", generic_dna) then both must have the
same hash. Then, if we also want Seq("ACGT") and
Seq("ACGT", generic_protein) to be equal, they too must
have the same hash. This means Seq("ACGT", generic_dna)
and Seq("ACGT",generic_protein) would have the same
hash, and therefore must evaluate as equal (!). The
natural consequence of this chain of logic is we would
then have Seq("ACGT") == Seq("ACGT", generic_dna)
== Seq("ACGT",generic_protein) == Seq("ACGT",...).
You reach the same point if we require the string
"ACGT" equals Seq("ACGT", some_alphabet)

i.e. Another option would be to base Seq equality
and hashing on the sequence string only (ignoring
the alphabet).

This would at least be a simple rule to remember (and
would mean we could implement less than, greater than
etc in the same way) but basically means we'd ignore
the alphabet.

So, currently in Biopython, we have object identity.
We could have string based identity. I've thought about
other options but haven't come up with anything that
would be self consistent (and could be hashed).
If anyone has a alternative idea, please speak up.

I don't know what thought process Jose went though,
but he wants to use the same equality test in his code:
http://lists.open-bio.org/pipermail/biopython/2009-November/005861.html

Changing Seq equality like this would make Biopython
much nicer to use for basic tasks. For example, my
code (and the unit tests) often contains things like if
str(seq1)==str(seq2).

If we want to make this change, it is quite a break to
backwards compatibility. (It also has the downside that
a DNA sequence ACGT and a protein sequence ACGT
would evaluate as equal - probably not a big issue in
practice but counter intuitive).

One way to handle this would be to start by adding
explicit Seq __eq__ methods etc which preserve the
current behaviour (i.e. act like id(seq1)==id(seq2)
based on object identity) but issue a deprecation
warning. Then for a series of releases people would
be encouraged to use str(seq1)==str(seq2) or
id(seq1)==id(seq2) as appropriate. Then, after this
transition period, we would change the __eq__
methods to adopt the new behaviour.

Or, we could have a Bio.Seq module level switch
to control the behaviour - initially defaulting to
the current system with a deprecation warning?

Peter

P.S. As a related point, we will need to switch the
MutableSeq from using __cmp__ to __eq__ etc
for future Python compatibility.