[Biojava-l]

Mon May 5 23:47:10 EDT 2003

Once upon a time, on a computer far far away, Ren, Zhen wrote:
> Hi,
> 
> This message refers to the WeightMatrixDemo example at "BioJava In Anger" page again (http://bioconf.otago.ac.nz/biojava/weightMatrix.htm).  I am trying to understand the null model behind this.  Experts out there, please help!
> 
> I slightly modified the code as below (DNA version):
>
> [snip]
>
> I understand 0.25 is 1/4 for 4 nucleotides.  However, let's look at 
> the protein version:
>
> [snip]
>
> Now, 0.045454545454545456 is 1/22.  Can anyone tell me why 22 here?

This does actually make sense: the protein alphabet in BioJava
contains 22 symbols: all the ones you'd expect, plus selenocystine
and a magical symbol called TERM which ideally shouldn't be there,
but is really helpful when making things like codon -> amino
acid translation tables.

The nullModel property of the standard distribution objects
defaults to a UniformDistribution over the appropriate alphabet.
For many (but by no means all) applications involving DNA, it
just happens that a uniform distibution *is* a fairly reasonable
background model.  For proteins, that's never true, so if you
are using the nullModel stuff (for instance, if you're using the
DP toolkit in log-odds scoring mode), then you really ought to
set it to something more sensible.  Just parsing through
swissprot and counting the overall amino acid usage in that
would at least be vaguely sane.

(Of course, many users of the Distribution objects don't touch
the nullModel stuff at all.  It really only matters if you want
odds rather than probabilities)

Hope this helps,

     Thomas.