[Biojava-l] equals() method for SymbolList

Phillip Lord p.lord@russet.org.uk
11 Oct 2002 17:00:56 +0100


>>>>> "Keith" == Keith James <kdj@sanger.ac.uk> writes:

>>>>> "Phillip" == Phillip Lord <p.lord@russet.org.uk> writes:

>>>>> "Matthew" == Matthew Pocock <matthew_pocock@yahoo.co.uk> writes:

  Matthew> SymbolList should be behaving like a string over its
  Matthew> symbols. It is silly if it doesn't do this. Hash codes
  Matthew> should realy be calculated in a different (but
  Matthew> sequence-dependant) way to avoid scanning the whole of very
  Matthew> large sequences just to do a hash lookup. Anyone got any
  Matthew> ideas?

  Phillip> Just make the hash out of say the first 10 elements in the
  Phillip> list. The hashcode is not meant to be unique for all
  Phillip> sequences, it's just a performance enhancement. So long as
  Phillip> equals returns false for different sequences, then there is
  Phillip> no problem.

  Keith> in a similar vein, the array sampling techniques at

  Keith> http://www273.pair.com/med/columns/Durable6.html

  Keith> would work, but equals would get called more often for
  Keith> sequences with similar base composition. How about first 10
  Keith> and then add in values for just the indices that are powers
  Keith> of two?

Probably be a good idea to factor in the length of the Alphabet as
well. If there are only a few symbols you get much more chance of a
collision because there are only unique values for the elements.

You will still get problems though if the sequence underneath changes,
while you are using it as a hash key.

Right, I really am going back to lurking now. 

Phil