[Biopython-dev] [GitHub] Bug 2947 viterbi

Sat Jan 29 23:28:21 UTC 2011

Hi all,

I'm forwarding a pull request via github (in future should we
try to have these sent to the dev list automatically?).

Does anyone familiar with HMM's want to look at this?

Thanks,

Peter

---------- Forwarded message ----------
From: GitHub <noreply at github.com>
Date: Sat, Jan 29, 2011 at 10:34 PM
Subject: [GitHub] Bug 2947 viterbi [biopython/biopython GH-1]
To: p.j.a.cock at googlemail.com

pgarland wants someone to pull from pgarland:bug-2947-viterbi:

Hello,

I think this fixes bug 2947. There were 2 errors in how the state
sequence was calcuated.

The first occurs at the beginning of the sequence.
viterbi_probs[(state_letters[0], -1)] was initialized to 1 and
viterbi_probs[(state_letters[0], -1)]  to 0 for all state letters
other than the zeroth. This is how it is described in Biological
Sequence Analysis by Durbin, et al, but it doesn't work for the code
as written because the code doesn't provide for an particular begin
state. By initializing the zeroth state letter to 1, and all the
others to 0, you're starting off by assigning a higher probability to
a state sequence that begins with the zeroth state letter.

I fixed this error by also setting  viterbi_probs[(state_letters[0],
-1)] to 0, so that all possible initial states are equally probable.

There's a second error in the Viterbi termination code. The algorithm
described in Durbin et al allows for modeling a particular end state.
Since the code as written doesn't provide for specifying an end state,
the termination code miscalculates the sequence probability. The
pseudocode in Durbin et al confusingly labels the end state as "0", at
least in the printing I have, and this seems to have been carried over
into the biopython code, where the zeroth state_letter is whichever
one is first in the list. The code as written calculated the
probability of the discovered state sequence multiplied by the
probability of transitioning from the sequence's last element to the
zeroth state named in state_letters, when it should just calculate the
probability of the discovered state sequence.

I fixed this by deleting the lines that were intended to account for
the transition to an end state.

It could be useful to specify particular begin and end states, but I
believe this patch should give correct results for all cases that
don't need that ability.

~Phillip

View Pull Request: https://github.com/biopython/biopython/pull/1