[Biojava-dev] gaps and holes
Matthew Pocock
matthew.pocock at ncl.ac.uk
Wed Oct 26 07:20:40 EDT 2005
Hi,
I understand there's been a thread about the biojava gaps model. For those of
you who want a not very good explanation, you track down my PhD on the Sanger
web site and scan through chapter 3. If my memory serves, there is a section
in there about the symbol group theory. Anyhoo - here is the long-and-short
of the problem...
The BioJava symbol model was designed to support algorithms. This lead us to
go a set-theoretic route for modelling the DNA/RNA/Protein/(insert biopolymer
here). This is visible in two places.
1) ambiguity is modelled using sets
* The base nucleotides a/g/c/t are interconvertible with the sets {a}, {g},
{c} and {t}
* Ambiguities are interconvertible with the sets that they range over e.g. n
-> {a,g,c,t}
* If you take the power set of DNA, you naturally have the bases, N and {} -
and as if by magic, we also need gaps, so gap -> {} seems like an obvious
rule
2) columns of an alignment are modelled as elements of cross-products of
symbol sets
* an alignment between two DNA sequences is a string from the alphabet DNA x
DNA, which by convention in BioJava is a string over Pow({a,g,c,t}) ^2
* for a symbol to be a member of DNA^2, it must be a cross-product symbol of
dimension 2, where each component s_1 and s_2 are from the DNA alphabet,
written [s_1, s_2].
OK - so, this all sounds fairly reasonable so far. Now for the more anoying
bits.
a) what happens if we have an ambiguous symbol in an alignment?
* Well, let's say we have the symbols s_1 and s_2, and s_1 is ambiguous -
that is, it maps to a set of symbols that does not have size=1. This can be
displayed as a column in an alignment just fine. E.g. the column [n,a] can be
expanded to [{a,g,c,t},{a}] and this in turn can be expanded to {[a,a],
[g,a], [c,a], [t,a]}.
* Just for the fun of it, let's take {[a,g], [g,a]} - we can't write this
down in the form [{i,j,...},{x,y,...}], so it can not be the column of an
alignment. The symbols that can be a column in an alignment are basis
symbols. Those that can not are just Symbol instances. Every basis symbol
could be used as a basis function in a probability distribution. Every
single-dimensional symbol is a basis symbol, but we tend to be even more
specific and call these atomic symbols.
b) Specifically, what happens if we have gaps in an alignment?
* Let's have ~ for {} - you'll see why in a moment...
* Let's have DNA^2
* We could write [~, a] to represent a column in an alignment where there
was a gap in the 1st sequence and a in the second. Similarly, [~,~] could
represent a gap in both.
* It follows by analogy that we would use [~] for the one-dimensional case.
If we push this 1-dimensional case notation back up to the 2d one, we get the
unweildy result [[~],[~]]
* Let's clean up the notation by keeping ~ -> {} and adding - -> [~]. Now we
can write [-,-] for the 2d case, - for the 1d case and ~ for the 0d case.
c) Wait a minute - the 0d case???
* Consider the empty alphabet. It is defined by the symbol set {}, so it
contains ~ only. Now let's say there's a finite-state machine that is
generating symbol lists. If the machine can never reach part of the symbol
list to generate it, then the symbol there is ~
* So - if you have a DNA sequence generated by a FSM, then the portion of
the tape before and after the generated symbol list is populated by ~, which
is kind of nice notationally because this is what multi-fasta uses to pad
out before & after sequences in alignments
d) Back to alignments, from a FSM-centric point of view
* Now we can use - to represent the case when the FSM advanced through a
state silently. That is, the FSM moved on one state, but the emitted symbol
list did not. If we choose to capture this 'emission slippage', then we need
to notate that nothing was emitted but that it took up one symbol's space in
the symbol list because one state was advanced. Hence, [~] is a reasonable
choice here.
* In pair-wise alignment, we can now use [-,x] and [x,-] to represent the
case where the FSM emitted nothing on the first or second tape, respectively.
[-,-] would represent the case where the FSM emitted on neither tape but
still advanced a state. ~ can still be used for the case when the FSM could
never generate a symbol, for example, outside the alignment matrix.
I hope this has made part of the rationalle for structured gaps more clear. I
agree that it is a bit strange, but if you want a consistent structure for
representing symbols as sets and for representing alignments, it prety much
drops out as The One True Way. We can split hairs about exactly when ~ and -
get uses, but they are different things, and if you confuse the two then
inside things like DP recursions, Very Bad Things happen which require
boundary conditions and nasty hacks to correct. Perhaps we need to use a
GapSymbol interface or have isGap on Symbol or something to make life easier.
It's a pitty the Java type system plays so badly with sets. Pitty ML isn't
generally accepted as being a useable language :-(
Matthew
More information about the biojava-dev
mailing list