[Biojava-l] [HMM] detecting several instances of the same motif fails

Mark Schreiber markjschreiber at gmail.com
Wed May 23 01:48:32 UTC 2007


Hi -

There are two things going on here. The first is that I beleive the
profile model presented in biojava doesn't loop back on itself. I
could be wrong I need to check the code. If this is indeed the case
then the model will not be capable of finding more than one match in a
sequence. This can be easily modified by changing the existing
ProfileHMM code in a custom class or getting a reference to the
MarkovModel and changing it's possible transitions.

The other issue is the type of scoring used. ScoreType.Probability
calculates the
Viterbi path based on the transitions of the model and the emission
probabilities of the states. ScoreType.NullModel uses the 'null model'
which in your case will be a uniform distribution (essentially random)
which will be meaningless, hence the strange result. The null model
would be more meaningful if you wanted to model some biased
background. ScoreType.ODDs is the log odds of the trained model and
the null model. It is most useful when the null model is not uniform,
eg where you want to distinguish a signal from biased background. It
is most often used for proteins where the background amino acid
distribution is anything but uniform.

Hope this helps,

- Mark

On 5/22/07, Evert-Jan Blom <e.j.blom at rug.nl> wrote:
> Dear all,
>
> Using a page from the CookBook
> http://www.biojava.org/wiki/BioJava:CookBook:DP:HMM we implemented a
> profile HMM
> in our application to detect regulatory motif instances. To test, we
> created a model based on 10 identical sequences
> (the test sequence was: TGCTGCTGCGGGCCC):
> The model is subsequently trained using a BaumWelchTrainer and decoded
> using the ScoreType.ODDS, ScoreType.Probability and ScoreType.NullModel
>
> The sequence we use for testing contains 2 motifs, a perfect motif and a
> motif with one mismatch:.
>
> AAAATGCTGCTGCGGGCCCAAAAATGCTGCGGCGGGCCCAAA
>
> The results of the original HMMER package tell me that there are 2
> instances of the motif present in the test string whereas the biojava
> package yields very strange results:
>
> results using the ScoreType.ODDS, only the second motif is detected:
>
> {AAAATGCTGCTGCGGGCCCAAAAATGCTGCGGCGGGCCCAAA}
> Log Odds = 7.65779871993799
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> i-0
> m-1
> m-2
> m-3
> m-4
> m-5
> m-6
> d-7
> m-8
> m-9
> m-10
> m-11
> m-12
> m-13
> d-14
> d-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
>
> Now the second scorer, only the first motif is detected:
>
> Prob = -95.9806747848816
> i-0
> i-0
> i-0
> i-0
> m-1
> m-2
> m-3
> m-4
> m-5
> m-6
> m-7
> m-8
> m-9
> m-10
> i-10
> i-10
> i-10
> i-10
> i-10
> i-10
> i-10
> i-10
> i-10
> i-10
> i-10
> i-10
> i-10
> i-10
> m-11
> i-11
> m-12
> i-12
> i-12
> i-12
> m-13
> m-14
> m-15
> i-15
> i-15
> i-15
> i-15
> i-15
>
> Now the null model which seems to make no sense at all:
> Null = -94.11166855273558
> m-1
> m-2
> m-3
> m-4
> m-5
> m-6
> m-7
> m-8
> m-9
> m-10
> m-11
> m-12
> m-13
> m-14
> m-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
> i-15
>
> Is there an option to detect the second motif in the same run just like
> the original HMMER? Or am I missing some
> option that is not described in the tutorial.
>
> Thanks in advance
>
> E.J.Blom
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



More information about the Biojava-l mailing list