[Biojava-l] Null Model

Ren, Zhen zren at amylin.com
Tue May 6 10:22:17 EDT 2003


Thanks again, Thomas.  These answers sound really good.  As I said earlier, I completely agree with you on counting amino acids in Swissprot or something to get a more reasonable background distribution for protein sequences.  However, let me take another example.  Here is a link to Homo sapiens amino acid composition:
http://www.ebi.ac.uk/proteome/index.html?http://www.ebi.ac.uk/proteome/HUMAN/structure/human_9606_amino.html

My question is how I would create a null model like this instead of 1/20 (5.00%) as the null weight for each.  Using the snippet below certainly would do the job, but it is a little awkward, isn't it?

    FiniteAlphabet alpha = ProteinTools.getTAlphabet();
    Distribution nullModel = new SimpleDistribution(alpha);
    for (Iterator i = alpha.iterator(); i.hasNext(); ) {
        Symbol s = (Symbol) i.next();
        if (s.getName().equals("SEC") || s.getName().equals("TER")) {
            nullModel.setWeight(s, 0);
        } else {
            nullModel.setWeight(s, 1.0 / 20.0);
        }
    }
    someOtherDistribution.setNullModel(nullModel);

The reasons I asked a series of questions regarding null model and log-odds is because I assumed all of these are available in BioJava already and I just don't know how to use them.  I appreciate lots of helps from this list.

Zhen

-----Original Message-----
From: Thomas Down [mailto:thomas at derkholm.net]
Sent: Tuesday, May 06, 2003 12:47 AM
To: biojava-l at biojava.org
Subject: Re: [Biojava-l] Null Model


Once upon a time, on a computer far far away, Ren, Zhen wrote:
> Thank you so much for the reply.  BioJava is such a valuable resource. 
> Since I am learning it, I have to ask more questions in order to 
> understand them:
> 
> 1) Notice in the protein version code in my previous email, I used ProteinTools.getAlphabet() method instead of ProteinTools.getTAlphabet().  Actually when you try both, you will get the same output.  I expected 1/21 for the former alphabet since it doesn't really include the TERM magic symbol.

The Distributions in your example program are actually being
created by DistributionTools.distributionOverAlignment, based
on the alphabet of the sequences in the alignment you use.  These
sequences are created by ProteinTools.createProtein, which always
uses the PROTEIN-TERM alphabet.  If you force the use of the
plain PROTEIN alphabet:

    SymbolList sl = new SimpleSymbolList(
        ProteinTools.getAlphabet().getTokenization("token"),
        "alvaa"
    );

you'll get the expected result of 1/21

> 2) The magic symbol TERM is really helpful when making things like codon -> amino acid translation tables.  However, this is not needed when making a null model.  How can I make 1/20 for the most common 20 amino acids and 0 for both SEC and TERM symbols as the null weight?  Is there a way to delete those two symbols or do I have to create my own alphabet?

You could create your own alphabet, but a neater solution
would just be to fix the null model:

    FiniteAlphabet alpha = ProteinTools.getTAlphabet();
    Distribution nullModel = new SimpleDistribution(alpha);
    for (Iterator i = alpha.iterator(); i.hasNext(); ) {
        Symbol s = (Symbol) i.next();
        if (s.getName().equals("SEC") || s.getName().equals("TER")) {
            nullModel.setWeight(s, 0);
        } else {
            nullModel.setWeight(s, 1.0 / 20.0);
        }
    }
    someOtherDistribution.setNullModel(nullModel);


[This is just illustrative: I still maintain that if you're going
to use a null model at all, the uniform distribution is not the
right way to go for protein work.  Count amino acids in Swissprot
or something to get a more reasonable background distribution
for protein sequences].

> 3) In Thomas's email, he mentioned "...For proteins, that's never true, so if you are using the nullModel stuff (for instance, if you're using the DP toolkit in log-odds scoring mode), then you really ought to set it to something more sensible.  Just parsing through swissprot and counting the overall amino acid usage in that would at least be vaguely sane.  It really only matters if you want odds rather than probabilities..."  This is absolutely true and I totally agree.  However, I am not really concerned about what kind of null model I should use.  I'd like to know how I set up my null model which is different from the default one using 1/22 as the null weight.  The simplest example is to change from 1/22 to 1/20.  This is the first part of the question.

Okay, see above.


I'll try and answer your second question later,

    Thomas.
_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l



More information about the Biojava-l mailing list