[Biojava-l] equality of proteins based on their aminoacid sequence signature

Andy Yates ayates at ebi.ac.uk
Fri Mar 11 22:48:34 UTC 2011


Hi Francois,

So I've been thinking about this & if we add this to a small set of objects (compounds & compound sets) we can get sequence equality working. This will be done as part of the SequenceMixin class & we can do case sensitive & insensitive versions. We can also do some tricks WRT length and compound sets to reject a pair of sequences without the need to iterate through the sequence. The code will look like

SequenceMixin.sequenceEquality(dnaOne, dnaTwo);

or

SequenceMixin.sequenceEqualityIgnoreCase(dnaOne, dnaTwo);

Don't forget you can also use checksums like md5 & sha1 to calculate a value which should be very unlikely to clash (projects like InterPro use this technique to cache results against a very quick lookup). You can do this like:

MessageDigest m = MessageDigest.getInstance("MD5");
for(Compound c: seq) {
  m.update(c.getShortName().getBytes());
}
BigInteger i = new BigInteger(1,m.digest());
String md5checksum = String.format("%1$032X", i);

HTH

Andy

On 10 Mar 2011, at 12:47, Andy Yates wrote:

> This is where the subject becomes murky & will probably mean that any code written for equals() & hashcode() will have to take them into account where present. However Sequence compound identity would still be available from another method but this will require an extension of the Sequence interface
> 
> Andy
> 
> On 10 Mar 2011, at 12:22, Francois Le Fevre wrote:
> 
>> This could be great. But for me equals means only séquence identity and not features. 
>> 
>> 
>>> Le 10 mars 2011 10:17, "Andy Yates" <ayates at ebi.ac.uk> a écrit :
>>> 
>>> I cannot remember the reason why we decided to not include equality for these objects. It's not an unreasonable thing to want though. Assuming I have some time soon I can have a look into implementing it on AbstractCompound, AbstractSequence & the backing stores but it will be some time away. If anyone else wants to give it a shot ... :)
>>> 
>>> Andy
>>> 
>>> On 10 Mar 2011, at 01:04, Andreas Prlic wrote:
>>> 
>>>> Hi François,
>>>> 
>>>> you could try to compare the st...
>>> 
>>> --
>>> Andrew Yates Ensembl Genomes Engineer
>>> EMBL-EBI Tel: +44-(0)1...
>>> 
>> 
> 
> -- 
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/








More information about the Biojava-l mailing list