[Bioperl-l] Re: ComparableI stuff

Mon Apr 19 06:10:38 EDT 2004

Hilmar Lapp wrote:

> I was going to write a more detailed response but probably won't be 
> getting to it before Monday due to painful deadlines. Generally, I 
> have a number of issues with this.
>
>     - On a very general level, basing equality on equal hash keys is 
> dangerous because it violates the standard definition of a hash key. 
> You construct it as a string most of the time and hence comparing keys 
> is meant as a short-cut for comparing objects, but I really would not 
> call it hash keys and at the same time assume equal objects iff equal 
> hash keys.
>
>     - Defining object equality for complex objects is more a matter of 
> subjective judgment rather than having objective criteria that we can 
> define and impose on everybody. In fact, I think doing so is 
> dangerous, because it creates the false impression of obviating 
> people's need to make their own appropriate decision on when to call 
> objects equal and when not to.
>
> As an example, your hashkey() on SeqFeatureI uses the positional 
> information but leaves out the sequence on which it sits, leaves out 
> the source_tag(), and in fact leaves out the entire tag system. It 
> also leaves out a feature's annotation. This definition of equality 
> may be fine in some cases, but may also be completely inappropriate in 
> others.
>
Ok, valid criticism. But the I just noticed that the hashkey() 
documentation is wrong. I shouldn't say that hashkey() can stand in for 
equality, it can't. That's the job of the diff() method. The hashkey() 
method is just used to order SeqFeatureI objects so that the comparison 
of feature lists can be done. Probably a bad name for the method (a 
result of development history).  

> In biosql, as an example, it is completely inappropriate; biosql 
> defines two SeqFeature entries as equal that are on the same sequence, 
> have the same primary_tag, same source_tag, and are in the same 
> position in the sequence's feature array. The features' display_name 
> is irrelevant, as is the positional information. If I compared two 
> seqfeatures using the ComparableI interface, they may compare as 
> unequal and yet if I store them I'd get thrown out with a unique key 
> failure. Chado I believe has a slightly different definition of the 
> unique key and may take the positional information into account.
>
> As another example, for genbank/embl features it is also inappropriate 
> because it doesn't test for equality of attached annotations. Note 
> that you may define equality of two feature table entries in a genbank 
> record by them containing the same annotation regardless of the 
> annotations' order of appearance. I.e., comparing the annotation 
> arrays element by element would be too strict then.
>
Wait a sec... I compare annotations at the sequence level by comparing 
the (sorted) annotation lists. The process of sorting is meant to deal 
with the problem of order.

> My point here is not that I would urge you to add all these properties 
> to the SeqFeatureI->diff implementation, because there may be use 
> cases for which your current implementation is perfectly fine. My 
> point is rather, I don't see the value of having one definition of 
> equality implemented in a way that doesn't allow others to coexist 
> when the one that is implemented is going to serve only a third of all 
> use cases.
>
> I guess one of my key problems is that in fact I don't understand what 
> the exact use case is. Apart from that, the implementation doesn't 
> seem to allow for multiple use cases, which will unequivocally result 
> in different implementations of equality having to peacefully co-exist.
>
> What I could rather envision as being useful is a design along 
> 'schemes', where you can swap in one 'equality scheme' for another 
> depending on what your needs are, and in which somebody who has a need 
> for a yet-unimplemented definition could add that implementation and 
> then swap it in.
>
Firstly, I think the question of the use case is the hub of our 
misunderstanding.

The reason I wrote this stuff is that I want a way to check that what 
SeqIO reads off disk is complete and correct. My understanding is that a 
sequence file in a particular format unambigously defines what a 
sequence (with its associated annotations) is. Two formats with the same 
power (e.g. genbank and embl) describing the same entry should turn into 
the same sequence object. And secondly, if a sequence is read from disk 
and then written to disk again, the resulting sequence file should parse 
into the same sequence information. (At the moment this is not the case 
for Bioperl)

This is all useful to me because I'm trying to write a set of tests to 
prove: does Bioperl do what it claims that it does? The current SeqIO 
tests focus on comparing values parsed from a given file against a set 
of 'known good values'. I feel that is a dangerous way to test because 
it focusses on a small number of test cases as opposed to a large range 
of possible real world data. So I decided to try and set a standard for 
how sequence parsing should behave, and that standard in my mind is that 
for a given format, a given set of attributes should be parseable and 
writeable in a predictable way. Ultimately this is all towards 
developing some way of 'validating' Bioperl, i.e. giving a kind of 
guarantee that where the documentation says it does this, it actually 
does do it. To do that, I need a way to enumerate as much as possible 
the expected behaviour, to put some kind of bounds on it in a fairly 
general sense. Does this make sense?

Your idea about 'equality schemes' is interesting. So in this case you'd 
apply a scheme (a singleton object, I assume) to two sequences, and get 
a particular result? So the diff() methods I've written thus far would 
be within this class? Its another possible approach, I guess. I'd like 
to hear other people's ideas.

Peter