[Biojava-l] LocationTools + Decoratorated Locations = ?

Cox, Greg gcox@netgenics.com
Thu, 7 Jun 2001 17:15:14 -0400


I'm feeling masochistic and ready to tackle the issue of how the location
semantics apply to decorated locations (currently circular and 'between'
locations).  

Between locations:
The problem:
	The standard BioJava location is defined to be a set of indices of
bases, which may or may not be contiguous.  This isn't sufficient for
locations like 5^6 which specifies the location in between base five and
six.  The anology that helps me is that biojava locations represent
integers, while 5^6 is a fraction in between.  
	To solve this, at the boot camp we decided to create a decorator
that would contain the 'betweenness' of this flavor of location.  This took
long enough that we didn't have the energy to hash out the semantics of how
the location operations apply to these.  In this e-mail I'm putting forward
a proposal.  No implementation has occurred, so feel free to tear it apart.
	The following operations need to be defined with respect to between
locations:
	areEqual
	union
	intersection
	overlaps
	contains

Proposed solution:
	This is a little sketchy and written toward between locations.  This
doesn't specify all the operations behavior.

	areEqual(Location A, Location B): remove the wrappers and delegate
to the wrapped location.
	Argument:
		This isn't as mindless as it sounds.  A legitamate between
location is 5^10, which indicates, "Points to a site between two adjacent
bases anywhere between bases" 5 and 10.  Therefore, two locations which are
both 5^10 could point to different locations, and 5^10 7^9 could point to
the same location.  I fear that way lies madness.  
	
	Union(Location A, Location B): If location A == location B, return
location A.  Otherwise, returns a compound location consisting of A and B.
These locations are not compressed, thus union(1..10, 5^6) results in the
compound location [1..10, 5^6] instead of 1..10.  
	Argument:
		If the locations are compressed, the information that the
space between 5 and 6 is included in this location is lost.


	intersection(Location A, Location B): Return true iff Location B ==
Location A or a sub location of Location A (in the case of compound
locations)
	Argument:
		To me, this falls out of the areEqual semantics; since
between locations are defined to not have a range (even if they wrap a range
location) they can only intersect if they are equal.  
	
	overlaps(Location A, Location B): Same semantics as intersection.
	Argument:
		And for the same reason.  Between locations have no range;
therefore intersection is equivalent to overlapping.

	contains(Location A, Location B): Returns false.  
	Argument:
		Again, this comes from between locations having no range.
I'm willing to go with contains inheriting intersection's semantics as well
though, since point locations contain equal point locations.  I have only a
slight preference for returning false.

Circular Locations:
The problem:
	Circular locations essentially redefine the coordinate system.  The
semantics are perfectly clear when dealing with locations that are circular
with the same circular length.  However, it's not clear what the result
should be when a location on a circular sequence of length 50 is intersected
with a location on a circular sequence of length 40.  

Potential solution:
	At the boot camp, we decided that the "right" thing to do is
visualize any sequence as extending to infinite length (Matthew, Thomas,
correct me if I misrepresent you).  Then a non-circular sequence occupies
the first n spaces of infinity, and circular sequences of length k occupy
the entire infinite sequence repeating every k spaces.  An ASCII
illustration of circular sequence GAC, circular sequence ACGT and linear
sequence ACGTTAC
123123123123...
GACGACGACGAC... 

123412341234
ACGTACGTACGT...

1234567
ACGTTAC
The circular sequences combined have a period of 12, and therefore any
operations on them will return a value in a coordinate system of the LCM of
the lengths of the arguments.  So, intersection(location on sequence 1,
location on sequence 2) would return a location on a hypothetical sequence
of length 12.  This is the mathematically correct way to handle circular
regions.  Reflecting on this approach though, I have the following
observations:
1) It's a lot of work
2) It will return a lot of funny locations.  For example, union(3..4 length
4, 3..1 length 3) yields the location join(1, 3..4, 6..12) length 12.
2) It's not clear there's a real payoff here.  Yes, it's mathematically
correct, but I'm not sure it will be used.

Which brings me to alternative solution:
	If operation(non-circular, circular) do the obvious
	If operation(circular length n, circular length k) 
		if k == n
			Do the obvious
		else
			Throw a new exception (base not match perhaps)

I apologize for the length of this; perhaps it should be named, "All you
never wanted to know about decorated locations but were too smart to ask."
Regardless, it's important and non-trivial so I'd like to make sure that
what I'm designing is useful.

Greg