[Biojava-l] Moving DP code & changing alignments

Matthew Pocock mrp@sanger.ac.uk
Thu, 24 Feb 2000 19:16:08 +0000


Dear all,

We have thoroughly hammered the Embl/Genbank & GFF code as well as the
feature creation magic. I have a realy cool script that eats an embl
entry & spits out a fasta file and a gff file. It needed supprisingly
little code. So - as far as I am concerned, this  is ready for general
use.

A time has gone by, the code-base has grown & we have got into a bit of
a mess. This doesn't impact upon the core interfaces like sequence &
feature, but does affect the more esoteric stuff.

The DP code is all muddled up with the concepts of alignments which is
all confused with weight-matricies. seq contains the cross-product
alphabets & residues which look a bit daunting at first sight. Also, the
alignment interface is fairly creaky. I think that we should move all of
the DP stuff into a DP package, and possibly move the alignment
interface & simple impl into seq.compound along with the cross-product
alphabet & residues.

Weight matricies would move with the DP code. Suffix trees would go into
seq.tools - they are realy usefull things to have arround, but seq
should be reserved for the core interfaces/classes.

If you have any views on this, say so. I will probably attack this
Monday. It needs sorting before any kind of release.

As an asside, for those of you who care I will explain the cross-product
stuff, and how it is rrrrrealy useful. If you don't care, then don't
read on :->

Matthew

Why I think that Cross-product Alphabets & Residues are a good idea.

First off, the day-to-day tasks of a biojava user will probably never
entail the use of these things directly. They are mathematical
constructions that let implementations generalise where you know
intuitively it should, but where different things are going on in the
API.

It is realy nice to define a ResidueList as a list of Residues from an
Alphabet. Most of the time this will be DNA, RNA or Protein, which is
why we have extra-special support for them. By defining a suitable
alphabet, we can handle lists of anything in a very safe way, and then
add features or leverage the whole of the rest of BioJava (GUIs, DP,
SVMs). So - you could read in stock-market values to a ResidueList with
an alphabet over currency, and do DP to find interesting shapes, or
compair two stocks using Smith-Waterman - or - load in an MP3 using a
sound alphabet & compair them to a profile HMM to see what genera of
music it belongs to. (Go on - I dair you - I have experimented with coin
and dice alphabets to try out the examples in Biological Sequence
Analysis Durbin et al '98).

We can define a cross product between two alphabets - here is an
example:

DNA X DNA = All_Dinucleotides

{a, g, c, t} X {a, g, c, t} =

      { aa, ag, ac, at,
        ga, gg, gc, gt,
        ca, cg, cc, ct,
        ta, tg, tc, tt }

In this case, each of the resulting Residues is a combination of two
residues from the left-hand-side. Rather than constructing lots of
complex objects, we just model this using the CrossProductResidue which
contains internaly a list of residues (in this case two DNA residues).
In the cases with lots of possibilities (e.g. DNA ^ 200) the alphabet
could be fruigal about what objects it acctualy instantiated!

We can combine two different alphabets e.g. Dice role and Coin toss like
this:

Dice X Coin = DiceCoin
{1, 2, 3, 4, 5, 6} X {h, t} =
      { 1h, 2h, 3h, 4h, 5h, 6h,
        1t, 2t, 3t, 4t, 5t, 6t }

This gives us all possible combinations of dice roles and coin tosses.
Importantly, the residues maintain their orders - 1h is different to h1.
This may look a bit irrelivant to informatics, but bare with me...

Using the DNA X DNA alphabet, we can take a DNA sequence and construct a
view of it as a list of overlapping dinucleotides. We can then take this
dinucleotide list and do all the normal things to it - learn an HMM for
spotting motifs, or find high-information areas or whatever. This is
quite cool in its own right.

An entire alignment has a fixed length that each sequence is fitted to -
usualy by adding gaps. If we take a single column, it can be viewed as a
list of residues (probably nucleotides or amino-acids). The whole
alignment could be seen as a list of columns. In the beginning we
defined a ResidueList to be a list of residues, where the residues are
members of an alphabet. The cross-product alphabets show us how to make
residues that store an ordered list of other residues. So - an alignment
can be re-cast as a special case of a ResidueList over a cross-product
alphabet. Phew.

Of course, there will be a simple view of one of these beasts to make it
look realy prety & like a grid. And - we will provide API to add/remove
gaps & things like that.

Since the alphabets being combined don't have to be of the same type, we
can add a slot for stooring a state-path, or some probability score, or
labelings (seccondary structure, exon/intron boundaries, all that stuff)
while still letting it be an alignment.

E.g. if you process a protein with a profile HMM, it would produce an
alignment object that used a ResidueList over the alphabet:

{Protein_alphabet, gap} X {profile states} X {probability}

And the alignment looks like:

(glu    )   (asp    )
(match_1) . (match_2) ...
(0.9    )   (0.8    )

Also, magicaly, any one of the ResidueLists contained in an alignment
could be over a cross-product alphabet - we can put alignments inside an
alignment - build up trees & phylogenies.

There are other realy usefull properties of these things when we start
playing with combining the states in different HMMs, but that is another
story...