[Biojava-l] Null Model

Thomas Down thomas at derkholm.net
Tue May 6 09:46:56 EDT 2003


Once upon a time, on a computer far far away, Ren, Zhen wrote:
> Thank you so much for the reply.  BioJava is such a valuable resource. 
> Since I am learning it, I have to ask more questions in order to 
> understand them:
> 
> 1) Notice in the protein version code in my previous email, I used ProteinTools.getAlphabet() method instead of ProteinTools.getTAlphabet().  Actually when you try both, you will get the same output.  I expected 1/21 for the former alphabet since it doesn't really include the TERM magic symbol.

The Distributions in your example program are actually being
created by DistributionTools.distributionOverAlignment, based
on the alphabet of the sequences in the alignment you use.  These
sequences are created by ProteinTools.createProtein, which always
uses the PROTEIN-TERM alphabet.  If you force the use of the
plain PROTEIN alphabet:

    SymbolList sl = new SimpleSymbolList(
        ProteinTools.getAlphabet().getTokenization("token"),
        "alvaa"
    );

you'll get the expected result of 1/21

> 2) The magic symbol TERM is really helpful when making things like codon -> amino acid translation tables.  However, this is not needed when making a null model.  How can I make 1/20 for the most common 20 amino acids and 0 for both SEC and TERM symbols as the null weight?  Is there a way to delete those two symbols or do I have to create my own alphabet?

You could create your own alphabet, but a neater solution
would just be to fix the null model:

    FiniteAlphabet alpha = ProteinTools.getTAlphabet();
    Distribution nullModel = new SimpleDistribution(alpha);
    for (Iterator i = alpha.iterator(); i.hasNext(); ) {
        Symbol s = (Symbol) i.next();
        if (s.getName().equals("SEC") || s.getName().equals("TER")) {
            nullModel.setWeight(s, 0);
        } else {
            nullModel.setWeight(s, 1.0 / 20.0);
        }
    }
    someOtherDistribution.setNullModel(nullModel);


[This is just illustrative: I still maintain that if you're going
to use a null model at all, the uniform distribution is not the
right way to go for protein work.  Count amino acids in Swissprot
or something to get a more reasonable background distribution
for protein sequences].

> 3) In Thomas's email, he mentioned "...For proteins, that's never true, so if you are using the nullModel stuff (for instance, if you're using the DP toolkit in log-odds scoring mode), then you really ought to set it to something more sensible.  Just parsing through swissprot and counting the overall amino acid usage in that would at least be vaguely sane.  It really only matters if you want odds rather than probabilities..."  This is absolutely true and I totally agree.  However, I am not really concerned about what kind of null model I should use.  I'd like to know how I set up my null model which is different from the default one using 1/22 as the null weight.  The simplest example is to change from 1/22 to 1/20.  This is the first part of the question.

Okay, see above.


I'll try and answer your second question later,

    Thomas.


More information about the Biojava-l mailing list