[Biojava-dev] Comments about OrderNDistributions

Francois Pepin fpepin at cs.mcgill.ca
Mon Mar 3 20:06:16 EST 2003


After going through the code for the OrderNDistributions, there are a
couple of comments and questions that I would have:

Is there any reason why the conditional probabilities instead of joint
probabilities are used there?

Right now, for OrderNDistribution.getWeight(cgt) (or any codon) gives
P(t|cg) while getting P(cgt) would be a lot more useful. It's quite easy
to go from the joint to the conditional probabilities while getting the
opposite information is pretty troublesome.

To get P(cgt), one would need to get P(t|cg)*sum of P(g|nc)*sum of
P(c|nn). (sum of P(g|nc)=P(g|ac)+P(g|cc)+P(g|gc)+P(g|tc) ).

I don't really see why not store it as joint probabilities and not have
to worry about the conditioning and conditioned alphabets there.

Also, I'm not positive about this, but it seems that some information
would be lost (or at least quite difficult to recover) about the first
few characters of the distribution, for example with AACCCGGG, it there
are no A's that would appear anywhere in a 3rd order (which would really
be a 2nd order Markov chain) distributions. Two ways of going around it
would be to carry all of the distributions of lower order to make sure
that you have the data around, but it's a bit cumbersome, or to have the
SymbolListViews.orderNSymbolList(AACCCGGG, 3) give out NNAACCCGGG in
this case, and have the orderNDistributions keep that into account.

What do people think about this?

Francois Pepin




More information about the biojava-dev mailing list