[Biojava-l] Null Model

Wed May 7 00:05:57 EDT 2003

Thank you, folks.

Please review the previous emails and pay attention to the way to set up the null model.  More clearly, I collect them together here:

Initially, I asked how to change 1/22 to 1/20 as the null weight for each of 20 common amino acids and 0 for SEC and TERM. Thomas suggested:

    FiniteAlphabet alpha = ProteinTools.getTAlphabet();
    Distribution nullModel = new SimpleDistribution(alpha);
    for (Iterator i = alpha.iterator(); i.hasNext(); ) {
        Symbol s = (Symbol) i.next();
        if (s.getName().equals("SEC") || s.getName().equals("TER")) {
            nullModel.setWeight(s, 0);
        } else {
            nullModel.setWeight(s, 1.0 / 20.0);
        }
    }
    someOtherDistribution.setNullModel(nullModel);

Then, I provided a link to human amino acid composition page.  Each amino acid has a different null weight.  Thomas provided the following code:

	    public Distribution readSymDistribution(Element cons)
	        throws Exception
	    {
	            Alphabet alph = AlphabetManager.instance().alphabetForName(cons.getAttribute("alphabet"));
	            SymbolTokenization nameParser = alph.getTokenization("name");
	            Distribution dist = DistributionFactory.DEFAULT.createDistribution(alph);

	            Node chld2 = cons.getFirstChild();
	            while (chld2 != null) {
	                if (chld2 instanceof Element) {
	                    Element weight = (Element) chld2;
	                    String sName = weight.getAttribute("symbol");
	                    Symbol sym = nameParser.parseToken(sName);

	                    double w = Double.parseDouble(weight.getAttribute("weight"));
	                    try {
	                        dist.setWeight(sym, w);
	                    } catch (ChangeVetoException ex) {
	                        throw new BioError(ex);
	                    }
	                }
	                chld2 = chld2.getNextSibling();
	            }

	            return dist;
	    }

Frankly speaking, I don’t really see any difference between them.  Notice both use the method dist.setWeight(sym, weight);
My question is really not how I get those numbers into my program (by reading from a file or using XML, or hard coding them or something else).  I suspect at the moment this is probably the only way to set up a null model other than using the default null model, which is 1/22 as the null weight for all 20 amino acids plus SEC and TERM.  Disagree?

Furthermore, there is a more serious problem.  Let's go back to my modified WeightMatrixDemo program (for protein) in "BioJava in Anger".  Again, here is the code:

1	import java.util.*;
2	import org.biojava.bio.dist.*;
3	import org.biojava.bio.dp.*;
4	import org.biojava.bio.seq.*;
5	import org.biojava.bio.symbol.*;
6
7	public class WeightMatrixDemo {
8         public static void main(String[] args) throws Exception {
9	        //make an Alignment of a motif.
10	        Map map = new HashMap();
11	        map.put("seq0", ProteinTools.createProtein("aggag"));
12	        map.put("seq1", ProteinTools.createProtein("aggaa"));
13	        map.put("seq2", ProteinTools.createProtein("aggag"));
14	        map.put("seq3", ProteinTools.createProtein("aagag"));
15	        Alignment align = new SimpleAlignment(map);
16
17	        //make a Distribution[] of the motif
18	        Distribution[] dists = DistributionTools.distOverAlignment(align, false, 0.01);
19
20	        //make a Weight Matrix
21	        WeightMatrix matrix = new SimpleWeightMatrix(dists);
22
23	        //the sequence to score against
24	        Sequence seq = ProteinTools.createProteinSequence("aaagcctaggaagaggagctgat","seq");
25
26	        //annotate the sequence with the weight matrix using a low threshold (0.1)
27            WeightMatrixAnnotator wma = new WeightMatrixAnnotator(matrix, 0.1);
28            seq = wma.annotate(seq);
29
30            //output match information
31            for(Iterator it = seq.features(); it.hasNext(); ) {
32                Feature f = (Feature)it.next();
33		      Location loc = f.getLocation();
34	            System.out.println("Match at " + loc.getMin()+"-"+loc.getMax());
35	            System.out.println("\tscore : "+f.getAnnotation().getProperty("score"));
36            }
37	    }
38	}

Notice on line 18,  DistributionTools.distOverAlignment() returns an array of Distribution over the alignment.  When calling this method, the default null model is already used for each Distribution.  Later using dist.setWeight(symbol, weight) doesn't really matter.  Or you do dist.setWeight(symbol, weight) before calling DistributionTools.distOverAlignment().  It doesn't matter either because DistributionTools doesn't take user-defined null models.  So, setting up a user-defined null model must be done within the DistributionTools.distOverAlignment() method.  After looking into the code inside DistributionTools.distOverAlignment(), the following lines:
	Distribution nullmodel = pos[i].getNullModel();
	nullmodel.setWeight(symbol, weight);
need to be added before this line:
	pos[i] = DistributionFactory.DEFAULT.createDistribution(alpha);

Hopefully, you see what I am saying here.  So, DistributionTools.java really needs to add the capacity of defining a non-default null model to do the distribution over an alignment.  Otherwise, the use is very limited.  Your thoughts?

Thank you very much.

Zhen