[Biojava-l] equals() method for SymbolList
Phillip Lord
p.lord@russet.org.uk
11 Oct 2002 17:00:56 +0100
>>>>> "Keith" == Keith James <kdj@sanger.ac.uk> writes:
>>>>> "Phillip" == Phillip Lord <p.lord@russet.org.uk> writes:
>>>>> "Matthew" == Matthew Pocock <matthew_pocock@yahoo.co.uk> writes:
Matthew> SymbolList should be behaving like a string over its
Matthew> symbols. It is silly if it doesn't do this. Hash codes
Matthew> should realy be calculated in a different (but
Matthew> sequence-dependant) way to avoid scanning the whole of very
Matthew> large sequences just to do a hash lookup. Anyone got any
Matthew> ideas?
Phillip> Just make the hash out of say the first 10 elements in the
Phillip> list. The hashcode is not meant to be unique for all
Phillip> sequences, it's just a performance enhancement. So long as
Phillip> equals returns false for different sequences, then there is
Phillip> no problem.
Keith> in a similar vein, the array sampling techniques at
Keith> http://www273.pair.com/med/columns/Durable6.html
Keith> would work, but equals would get called more often for
Keith> sequences with similar base composition. How about first 10
Keith> and then add in values for just the indices that are powers
Keith> of two?
Probably be a good idea to factor in the length of the Alphabet as
well. If there are only a few symbols you get much more chance of a
collision because there are only unique values for the elements.
You will still get problems though if the sequence underneath changes,
while you are using it as a hash key.
Right, I really am going back to lurking now.
Phil