[Biojava-l] Null Model

Schreiber, Mark mark.schreiber at agresearch.co.nz
Wed May 7 21:47:14 EDT 2003


Hi -
 
Agreed the DistributionTools.distOverAlignment does force you to use the default (uniform) null model. It is really there as a convenience method. It would not be a mjor difficulty to add another method to allow the use of a user specified null model.
 
However, as it is just a convenience method, you could produce the same effect by taking the code from the body of that method and placing it in your program, modifying it slightly to use your specified null model.
 
Much of BioJava now exists on these two levels. The methods of the tools classes represent what people commonly do. They call the underlying and very flexible biojava APIs. They were generated to help people get over the biojava learning curve and to encapsulate code for common tasks. You should not feel restricted by what the tools classes can do. If they don't do the custom task you want then have a look at their internal code. It should give you some good insights into how to get started on your custom tasks. Looking at the code in the "Tools" methods gives a good insight into how biojava works on the inside.
 
- Mark
 

	-----Original Message----- 
	From: Ren, Zhen [mailto:zren at amylin.com] 
	Sent: Wed 7/05/2003 6:05 p.m. 
	To: Schreiber, Mark; Thomas Down; biojava-l at biojava.org 
	Cc: 
	Subject: RE: [Biojava-l] Null Model
	
	

	Thank you, folks.
	
	Please review the previous emails and pay attention to the way to set up the null model.  More clearly, I collect them together here:
	
	Initially, I asked how to change 1/22 to 1/20 as the null weight for each of 20 common amino acids and 0 for SEC and TERM. Thomas suggested:
	
	    FiniteAlphabet alpha = ProteinTools.getTAlphabet();
	    Distribution nullModel = new SimpleDistribution(alpha);
	    for (Iterator i = alpha.iterator(); i.hasNext(); ) {
	        Symbol s = (Symbol) i.next();
	        if (s.getName().equals("SEC") || s.getName().equals("TER")) {
	            nullModel.setWeight(s, 0);
	        } else {
	            nullModel.setWeight(s, 1.0 / 20.0);
	        }
	    }
	    someOtherDistribution.setNullModel(nullModel);
	
	Then, I provided a link to human amino acid composition page.  Each amino acid has a different null weight.  Thomas provided the following code:
	
	            public Distribution readSymDistribution(Element cons)
	                throws Exception
	            {
	                    Alphabet alph = AlphabetManager.instance().alphabetForName(cons.getAttribute("alphabet"));
	                    SymbolTokenization nameParser = alph.getTokenization("name");
	                    Distribution dist = DistributionFactory.DEFAULT.createDistribution(alph);
	       
	                    Node chld2 = cons.getFirstChild();
	                    while (chld2 != null) {
	                        if (chld2 instanceof Element) {
	                            Element weight = (Element) chld2;
	                            String sName = weight.getAttribute("symbol");
	                            Symbol sym = nameParser.parseToken(sName);
	                          
	                            double w = Double.parseDouble(weight.getAttribute("weight"));
	                            try {
	                                dist.setWeight(sym, w);
	                            } catch (ChangeVetoException ex) {
	                                throw new BioError(ex);
	                            }
	                        }
	                        chld2 = chld2.getNextSibling();
	                    }
	       
	                    return dist;
	            }
	
	Frankly speaking, I don’t really see any difference between them.  Notice both use the method dist.setWeight(sym, weight);
	My question is really not how I get those numbers into my program (by reading from a file or using XML, or hard coding them or something else).  I suspect at the moment this is probably the only way to set up a null model other than using the default null model, which is 1/22 as the null weight for all 20 amino acids plus SEC and TERM.  Disagree?
	
	Furthermore, there is a more serious problem.  Let's go back to my modified WeightMatrixDemo program (for protein) in "BioJava in Anger".  Again, here is the code:
	
	1       import java.util.*;
	2       import org.biojava.bio.dist.*;
	3       import org.biojava.bio.dp.*;
	4       import org.biojava.bio.seq.*;
	5       import org.biojava.bio.symbol.*;
	6
	7       public class WeightMatrixDemo {
	8         public static void main(String[] args) throws Exception {
	9               //make an Alignment of a motif.
	10              Map map = new HashMap();
	11              map.put("seq0", ProteinTools.createProtein("aggag"));
	12              map.put("seq1", ProteinTools.createProtein("aggaa"));
	13              map.put("seq2", ProteinTools.createProtein("aggag"));
	14              map.put("seq3", ProteinTools.createProtein("aagag"));
	15              Alignment align = new SimpleAlignment(map);
	16
	17              //make a Distribution[] of the motif
	18              Distribution[] dists = DistributionTools.distOverAlignment(align, false, 0.01);
	19
	20              //make a Weight Matrix
	21              WeightMatrix matrix = new SimpleWeightMatrix(dists);
	22
	23              //the sequence to score against
	24              Sequence seq = ProteinTools.createProteinSequence("aaagcctaggaagaggagctgat","seq");
	25
	26              //annotate the sequence with the weight matrix using a low threshold (0.1)
	27            WeightMatrixAnnotator wma = new WeightMatrixAnnotator(matrix, 0.1);
	28            seq = wma.annotate(seq);
	29
	30            //output match information
	31            for(Iterator it = seq.features(); it.hasNext(); ) {
	32                Feature f = (Feature)it.next();
	33                    Location loc = f.getLocation();
	34                  System.out.println("Match at " + loc.getMin()+"-"+loc.getMax());
	35                  System.out.println("\tscore : "+f.getAnnotation().getProperty("score"));
	36            }
	37          }
	38      }
	
	Notice on line 18,  DistributionTools.distOverAlignment() returns an array of Distribution over the alignment.  When calling this method, the default null model is already used for each Distribution.  Later using dist.setWeight(symbol, weight) doesn't really matter.  Or you do dist.setWeight(symbol, weight) before calling DistributionTools.distOverAlignment().  It doesn't matter either because DistributionTools doesn't take user-defined null models.  So, setting up a user-defined null model must be done within the DistributionTools.distOverAlignment() method.  After looking into the code inside DistributionTools.distOverAlignment(), the following lines:
	        Distribution nullmodel = pos[i].getNullModel();
	        nullmodel.setWeight(symbol, weight);
	need to be added before this line:
	        pos[i] = DistributionFactory.DEFAULT.createDistribution(alpha);
	
	Hopefully, you see what I am saying here.  So, DistributionTools.java really needs to add the capacity of defining a non-default null model to do the distribution over an alignment.  Otherwise, the use is very limited.  Your thoughts?
	
	Thank you very much.
	
	Zhen
	
	
	


=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================



More information about the Biojava-l mailing list