[Biojava-dev] gaps and holes

Wed Oct 26 07:20:40 EDT 2005

Hi,

I understand there's been a thread about the biojava gaps model. For those of 
you who want a not very good explanation, you track down my PhD on the Sanger 
web site and scan through chapter 3. If my memory serves, there is a section 
in there about the symbol group theory. Anyhoo - here is the long-and-short 
of the problem...

The BioJava symbol model was designed to support algorithms. This lead us to 
go a set-theoretic route for modelling the DNA/RNA/Protein/(insert biopolymer 
here). This is visible in two places.

1) ambiguity is modelled using sets

  * The base nucleotides a/g/c/t are interconvertible with the sets {a}, {g}, 
{c} and {t}

  * Ambiguities are interconvertible with the sets that they range over e.g. n 
-> {a,g,c,t}

  * If you take the power set of DNA, you naturally have the bases, N and {} - 
and as if by magic, we also need gaps, so gap -> {} seems like an obvious 
rule

2) columns of an alignment are modelled as elements of cross-products of 
symbol sets

  * an alignment between two DNA sequences is a string from the alphabet DNA x 
DNA, which by convention in BioJava is a string over Pow({a,g,c,t}) ^2

  * for a symbol to be a member of DNA^2, it must be a cross-product symbol of 
dimension 2, where each component s_1 and s_2 are from the DNA alphabet, 
written [s_1, s_2].

OK - so, this all sounds fairly reasonable so far. Now for the more anoying 
bits.

a) what happens if we have an ambiguous symbol in an alignment?

  * Well, let's say we have the symbols s_1 and s_2, and s_1 is ambiguous - 
that is, it maps to a set of symbols that does not have size=1. This can be 
displayed as a column in an alignment just fine. E.g. the column [n,a] can be 
expanded to [{a,g,c,t},{a}] and this in turn can be expanded to {[a,a], 
[g,a], [c,a], [t,a]}.

  * Just for the fun of it, let's take {[a,g], [g,a]} - we can't write this 
down in the form [{i,j,...},{x,y,...}], so it can not be the column of an 
alignment. The symbols that can be a column in an alignment are basis 
symbols. Those that can not are just Symbol instances. Every basis symbol 
could be used as a basis function in a probability distribution. Every 
single-dimensional symbol is a basis symbol, but we tend to be even more 
specific and call these atomic symbols.

b) Specifically, what happens if we have gaps in an alignment?

  * Let's have ~ for {} - you'll see why in a moment...

  * Let's have DNA^2

  * We could write [~, a] to represent a column in an alignment where there 
was a gap in the 1st sequence and a in the second. Similarly, [~,~] could 
represent a gap in both.

  * It follows by analogy that we would use [~] for the one-dimensional case. 
If we push this 1-dimensional case notation back up to the 2d one, we get the 
unweildy result [[~],[~]]

  * Let's clean up the notation by keeping ~ -> {} and adding - -> [~]. Now we 
can write [-,-] for the 2d case, - for the 1d case and ~ for the 0d case.

c) Wait a minute - the 0d case???

  * Consider the empty alphabet. It is defined by the symbol set {}, so it 
contains ~ only. Now let's say there's a finite-state machine that is 
generating symbol lists. If the machine can never reach part of the symbol 
list to generate it, then the symbol there is ~

  * So - if you have a DNA sequence generated by a FSM, then the portion of 
the tape before and after the generated symbol list is populated by ~, which 
is kind of nice notationally  because this is what multi-fasta uses to pad 
out before & after sequences in alignments

d) Back to alignments, from a FSM-centric point of view

  * Now we can use - to represent the case when the FSM advanced through a 
state silently. That is, the FSM moved on one state, but the emitted symbol 
list did not. If we choose to capture this 'emission slippage', then we need 
to notate that nothing was emitted but that it took up one symbol's space in 
the symbol list because one state was advanced. Hence, [~] is a reasonable 
choice here.

  * In pair-wise alignment, we can now use [-,x] and [x,-] to represent the 
case where the FSM emitted nothing on the first or second tape, respectively. 
[-,-] would represent the case where the FSM emitted on neither tape but 
still advanced a state. ~ can still be used for the case when the FSM could 
never generate a symbol, for example, outside the alignment matrix.

I hope this has made part of the rationalle for structured gaps more clear. I 
agree that it is a bit strange, but if you want a consistent structure for 
representing symbols as sets and for representing alignments, it prety much 
drops out as The One True Way. We can split hairs about exactly when ~ and - 
get uses, but they are different things, and if you confuse the two then 
inside things like DP recursions, Very Bad Things happen which require 
boundary conditions and nasty hacks to correct. Perhaps we need to use a 
GapSymbol interface or have isGap on Symbol or something to make life easier. 
It's a pitty the Java type system plays so badly with sets. Pitty ML isn't 
generally accepted as being a useable language :-(

Matthew