[Biojava-l] BaumWelchTrainer Broken??!!! (please help)
Martin Eklund
martin.eklund at farmbio.uu.se
Thu Nov 24 06:21:59 EST 2005
Hi,
I looked to Biojava to train an HMM given a FASTA alignment file and
then use the HMM to generate 'new' sequences similar to the ones in the
FASTA file. However, I run into exactly the same problem as Todd Riley
(quoted below). Also, if I - despite the training problems - try to
generate a sequence from the HMM I get an uncaught ClassCastException
according to:
Exception in thread "main" java.lang.ClassCastException:
org.biojava.bio.symbol.SimpleBasisSymbol
at org.biojava.bio.dp.DP.generate(DP.java:594)
at testHMM.main(testHMM.java:95)
Any ideas about how to resolve these issues?
Thank you!
Martin.
==============================================================
Admitidley it's been a very long time since I tried any of these. I'm
pretty sure they worked way back then?
Matthew, do you have any insights? These are your babies right?
- Mark
Todd Riley <toddri at eden.rutgers.edu>
Sent by: biojava-l-bounces at portal.open-bio.org
11/23/2005 05:03 AM
To: Mark Schreiber/GP/Novartis at PH, biojava-l at biojava.org
cc:
Subject: Re: [Biojava-l] BaumWelchTrainer Broken??!!! (please help)
I have received info (from at least 3 other people) that have had the
same problem with the BaumWelchTrainer class. All three of these
individuals eventually gave up and went elsewhere (other software) in
order to perform Baum Welch EM on their models.
There definitely is a problem with the BaumWelchTrainer class. It's
either a documentation bug or coding bug. The demos shipped in the V1.4
source (demos/dp/PatternFinder.java , demos/dp/SearchProfile.java) don't
work, and the source code from
http://www.biojava.org/docs/bj_in_anger/profileHMM.htm doesn't work (it
crashes).
If someone, who has worked with and knows how to get the
BaumWelchTrainer object to work, can test the following code (taken
almost entirely from profileHMM.htm above) on the current release (1.4),
it would be greatly appreciated.
Thanks in advance!
-Todd
Todd Riley wrote:
> *******************************************************************
> My file that contains the code from the demo profileHMM.htm found in
> "Biojava In Anger" starts here:
> *******************************************************************
>
> /*
> * DemoPHMM.java - Directly from
> http://www.biojava.org/docs/bj_in_anger/profileHMM.htm
> *
> */
>
> import java.util.*;
> import java.io.BufferedReader;
> import java.io.FileOutputStream;
> import java.io.PrintStream;
> import java.io.FileReader;
> import java.io.IOException;
> import java.util.StringTokenizer;
> import java.io.File;
> import javax.swing.JFrame;
> import java.awt.event.*;
>
> //import biojava.*;
> //import biojava.BaumWelchTrainer;
> //import biojava.TrainingAlgorithm;
> import org.biojava.bio.*;
> import org.biojava.bio.dist.*;
> import org.biojava.bio.dp.*;
> import org.biojava.bio.seq.*;
> import org.biojava.bio.seq.db.*;
> import org.biojava.bio.seq.io.*;
> import org.biojava.bio.symbol.*;
> import org.biojava.utils.*;
>
> public class DemoPHMM {
>
> public static void main(String[] args) throws IOException {
> DemoPHMM hmm = new DemoPHMM();
> hmm.letsDoThis(args);
> }
>
>
> public void letsDoThis(String[] args) throws IOException {
> if (args.length < 1 || args[0].equals("-help") ||
> args[0].equals("-?")) {
> System.out.println("\n Usage: DemoPHMM
> <Fasta-Training-Set-File>");
> System.exit(-1);
> }
>
> String trainingSet=args[0];
>
> try {
> /*
> * Make a profile HMM over the DNA Alphabet with 12 'columns'
> and default
> * DistributionFactories to construct the transition and
emmission
> * Distributions
> */
> ProfileHMM hmm = new ProfileHMM(DNATools.getDNA(),
> 20,
> DistributionFactory.DEFAULT,
> DistributionFactory.DEFAULT,
> "my profilehmm");
>
> //create the Dynamic Programming matrix for the model.
> DP dp = DPFactory.DEFAULT.createDP(hmm);
>
> //Database to hold the training set
> //SequenceDB db = new HashSequenceDB();
> //code here to load the training set
> SequenceDB db =
> IOUtility.readSequenceDB(trainingSet,DNATools.getDNA());
>
> //train the model to have uniform parameters
> ModelTrainer mt = new SimpleModelTrainer();
> //register the model to train
> mt.registerModel(hmm);
> //as no other counts are being used the null weight will cause
> everything to be uniform
> mt.setNullModelWeight(1.0);
> mt.train();
>
> //create a BW trainer for the dp matrix generated from the HMM
> BaumWelchTrainer bwt = new BaumWelchTrainer(dp);
>
> //anonymous implementation of the stopping criteria interface
> to stop after 20 iterations
> StoppingCriteria stopper = new StoppingCriteria(){
> public boolean isTrainingComplete(TrainingAlgorithm ta){
> System.out.println("\t\tCycle: " + ta.getCycle() + " score:
> " + ta.getCurrentScore() + " " + (ta.getCurrentScore() -
> ta.getLastScore()) );
> return (ta.getCycle() > 20);
> }
> };
> /*
> * optimize the dp matrix to reflect the training set in db
> using a null model
> * weight of 1.0 and the Stopping criteria defined above.
> */
> bwt.train(db,1.0,stopper);
>
> //SymbolList test = null;
> //code here to initialize the test sequence
> Sequence test =
> DNATools.createDNASequence("tacaGAACATGTCTAAGCATGCTGggga", "mySeq");
> /*
> * put the test sequence in an array, an array is used because
> for pairwise
> * alignments using an HMM there would need to be two
> SymbolLists in the
> * array
> */
> SymbolList[] sla = {(SymbolList)test};
> //decode the most likely state path and produce an 'odds' score
> StatePath path = dp.viterbi(sla, ScoreType.ODDS);
> System.out.println("Log Odds = "+path.getScore());
>
> //print state path
> for(int i = 1; i <= path.length(); i++){
> System.out.println(path.symbolAt(StatePath.STATES, i).getName());
> }
> }
> catch (Exception ex) {
> ex.printStackTrace();
> //System.err.println("symbol is "+symbol);
> //System.err.println("distribution is
> "+StringUtility.distributionToString(emissionDist));
> System.exit(-1);
> }
>
> }
>
> }
>
> *******************************************************************
> My output from running this code above starts here:
> *******************************************************************
> Cycle: 1 score: -1105.9598698420707 -Infinity
> Cycle: 2 score: -1000.3026011513825 105.65726869068817
> Cycle: 3 score: NaN NaN
> Cycle: 4 score: NaN NaN
> Cycle: 5 score: NaN NaN
> Cycle: 6 score: NaN NaN
> Cycle: 7 score: NaN NaN
> Cycle: 8 score: NaN NaN
> Cycle: 9 score: NaN NaN
> Cycle: 10 score: NaN NaN
> Cycle: 11 score: NaN NaN
> Cycle: 12 score: NaN NaN
> Cycle: 13 score: NaN NaN
> Cycle: 14 score: NaN NaN
> Cycle: 15 score: NaN NaN
> Cycle: 16 score: NaN NaN
> Cycle: 17 score: NaN NaN
> Cycle: 18 score: NaN NaN
> Cycle: 19 score: NaN NaN
> Cycle: 20 score: NaN NaN
> Cycle: 21 score: NaN NaN
> java.lang.NullPointerException
> at org.biojava.bio.dp.onehead.SingleDP.viterbi(SingleDP.java:650)
> at org.biojava.bio.dp.onehead.SingleDP.viterbi(SingleDP.java:513)
> at DemoPHMM.letsDoThis(DemoPHMM.java:103)
> at DemoPHMM.main(DemoPHMM.java:33)
>
> *******************************************************************
> My fasta training sequence file starts here:
> *******************************************************************
> >Funk_Sequence_1
> GGACATGCCCGGGCATGTT
> >Funk_Sequence_2
> GAACATGCCCGGGCATGTCT
> >Funk_Sequence_3
> GGACATGCCCGGGCATGTCG
> >Funk_Sequence_4
> GGGCATGCCCGGGCATGTCT
> >Funk_Sequence_5
> GAACATGCCCGGGCATGTCC
> >Funk_Sequence_6
> AAACATGCCCGGGCATGTTC
> >Funk_Sequence_7
> GGACATGCCCGGGCATGTCT
> >Funk_Sequence_8
> GGACATGCCCGGGCATGTCG
> >Funk_Sequence_9
> AAACATGCCCGGGCATGCCC
> >Funk_Sequence_10
> GGGCATGCCCGGGCATGTTC
> >Funk_Sequence_11
> AGACATGCCCGGGCATGTCT
> >Funk_Sequence_12
> GGACATGCCCGGGCATGTCT
> >Funk_Sequence_13
> GGACATGCCCGGGCATGCCC
> >Funk_Sequence_14
> GGACATGTCCGGACATGTTC
> >Funk_Sequence_15
> GGACATGTCCGGACATGTCT
> >Funk_Sequence_16
> AAACATGTCCGGGCATGTCC
> >Funk_Sequence_17
> GGACATGTCCGGGCATGTCT
>
> >ElnDeiry_Sequence_1
> GGGCCTGTCACAGCATGCCT
> >ElnDeiry_Sequence_2
> CTGCATGTCTAGGCAAGTCA
> >ElnDeiry_Sequence_3
> AAACATGCCCAGACTTGTCT
> >ElnDeiry_Sequence_4
> AGGCATGCCTTTGCCT
> >ElnDeiry_Sequence_5
> GGGCATGTTTAGGCAAGCTT
> >ElnDeiry_Sequence_6
> AGACATGTTATAACAAGTCA
> >ElnDeiry_Sequence_7
> TGACATGTCCCGACGTGTTT
> >ElnDeiry_Sequence_8
> AGGCATGTTCGGGCTGTCT
> >ElnDeiry_Sequence_9
> TGACTTGCCTTGACATGTTC
> >ElnDeiry_Sequence_10
> CAGCTGCCAAGGCATGCAG
> >ElnDeiry_Sequence_11
> CAACTTGTCTGGACATGTTC
> >ElnDeiry_Sequence_12
> AGACAAGCCTGGGCAGGTCC
> >ElnDeiry_Sequence_13
> AAACAAGCCCGGATGTGCCC
> >ElnDeiry_Sequence_14
> ACACTTGTCTATACCTGCCT
> >ElnDeiry_Sequence_15
> AAACATGCTTTGACATGTTC
> >ElnDeiry_Sequence_16
> GGACTTGCCCTGGCCAGCCC
> >ElnDeiry_Sequence_17
> AGGTTTGCCGGGCTTGTTC
> >ElnDeiry_Sequence_18
> TGACTTGCCCAGACATGTTT
> >ElnDeiry_Sequence_19
> AAGCATGCCTTGACTTGTTC
> >ElnDeiry_Sequence_20
> TGCCTTGCCTGGACTTGCCT
>
>
> mark.schreiber at novartis.com wrote:
>
>> Can you try the code in
>> http://www.biojava.org/docs/bj_in_anger/profileHMM.htm
>>
>> I have found in the past that you need to set some intial weights
>> before starting the BW trainer. If this example doesn't work please
>> repost to the list.
>>
>> - Mark
>>
>>
>>
>>
>>
> _______________________________________________
> Biojava-l mailing list - Biojava-l at biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
_______________________________________________
Biojava-l mailing list - Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l
--
========================================
Martin Eklund
PhD Student
Department of Pharmaceutical Biosciences
Uppsala University, Sweden
Ph: +46-18-4714281
========================================
More information about the Biojava-l
mailing list