[Biojava-dev] Problem with ranks

mark.schreiber at novartis.com mark.schreiber at novartis.com
Tue Sep 12 03:37:55 UTC 2006


Hi George, thanks for raising these issues. We should fix this before 
biojava 1.5 finishes it's beta testing. See my responses below. Richard 
Holland and David Scott will no doubt have comments too.

>I am having difficulties to use ranking with some objects found in 
SimpleRichSequence. There are 6 objects >contained in SimpleRichSequence 
which are found within collections, namely SimpleComment, 
SimpleRankedCrossRef, >SimpleRankedDocRef, SimpleNote, 
SimpleBioEntryRelationShip, and SimpleRichFeature. Each of them is 
associated with >a TreeSet and uses to some extend ranking for comparison.
>
>Ranks are never described but the name suggests that they are positive 
integer, in consecutive order and not >identical for similar objects 
within the same sequence. Here are some questions:

Ranks actually come from the BioSQL schema. They are used so that lists of 
features, comments etc that are stored in database tables (or any other 
collection) can be reassembled in the same order that they are found in 
the original flatfile (Genbank etc). Simply put they are used to preserve 
order.

> - Can rank be negative? We would assume not but this is never checked.

I suppose it could be but it would make no sense given the above 
description. We should probably document this in the javadocs and suggest 
that classes enforce the non-negative rule.

- If rank cannot be negative, where do they start, 0, 1? 
SimpleBioEntryRelationShip suggests that they start at 1 with 0 reserved 
for absence of ranking.

At the moment this strictly depends on the creating object. Typically this 
would be a RichSequenceFormat implementation. The Genbank format appears 
to start numbering from either 0 or 1 (for comments). There should be a 
common rule.

>- Are we expecting ranks to be in consecutive order (or in reasonable 
consecutive order) or values like 1000, >2000, etc. are possible or even 
expected?

Is there any reason why we need to enforce this rule? It would be tidier 
but it would be a pain to have to re-order everything just because one 
object is deleted. The genbank parser currently numbers sequentially.

>- Can we have duplicate ranks? We would assume not but SimpleRichFeature 
javadoc indicates that equal ranks are >*acceptable*.

Certainly all the RankedCrossRefs returned by the Genbank parser have the 
same rank (0). It is possible as long as the objects are somehow unique. 
If equals() is true then the objects are overwritten. I don't think any 
Ranked object currently relies only on rank for equality (or for the 
compare() method either). The Unit tests do a pretty good job of testing 
equals and compare and making sure they return logically equivalent 
values. Although it is possible it may not be desirable. Any thoughts?

>SimpleBioEntryRelationship getRank method returns an Integer object, all 
the other objects return an integer >number. Any reason for this?

I think Richard has a reason. Something to do with Hibernate?? Richard??

>Moreover 3 of these objects do not have a setRank method: SimpleComment, 
SimpleRankedCrossRef and >SimpleRankedDocRef. How do I insert a comment in 
the middle of other comments, how do I change the order of these >objects 
without creating new ones?

Possibly they should. Making things mutable is always tricky but the other 
objects with setRank methods register change listeners and have the option 
of vetoing the change so it can be done safely. The ChangeListener could 
be in charge of re-ordering ranks if you insert into the middle.

>All these objects have an ordering consistent with equality except 
SimpleRichFeature. SimpleRichFeature are sorted >by rank only. Its 
compareTo method also never returns 0. A consequence is that removeFeature 
in ThinRichSequence >never works because TreeSet uses compareTo for 
testing equality.

OK, that sounds like a bug that we have missed in the Unit tests. I will 
report it to bugzilla and fix it when I have time.

>All compareTo methods use rank first except SimpleRankedDocRef which does 
not use rank at all (but is ranked as >its name indicates).

We should change this. Another bugzilla report.

>A few objects are nearly identical when they are equal but not all. 
SimpleNote compares by rank then by term but >not by value. SimpleNotes of 
same rank and term but different values are nevertheless equal. 
SimpleRankedDocRef >can be equal and have different locations ? I can 
understand this. 

This is because the term of a SimpleNote is an ontology term and should 
therefore have only one value. Two Notes with the same term are therefore 
the same (or should be). For example if the term or keyword of the Note is 
Organism: there should only be one of these Notes.

>We need a clear definition of what ranks are, what the ordering they 
imply is intended for and how to deal with >duplicate ranks? Maybe we 
could have an interface that encapsulates the concept of ranking, e.g. 
interface Ranked, >methods setRank() and getRank()) and all these 
information grouped in the javadoc. It seems easier to derive >exceptions 
from a common pattern that the opposite. Maybe we also need separate 
comparators when they are not >consistent with equal. 

I think we should have a 'Ranked' interface with clear rules in the 
javadoc. I can't think of any good reason why comparable and equal should 
not be consistent. We should try and keep them the same as much as 
possible.

- Mark






More information about the biojava-dev mailing list