[Biopython] Python equivalent of the Perl String::Approx module for approximate matching?
Tal Einat
taleinat at gmail.com
Wed Mar 12 18:00:16 UTC 2014
> Kevin wrote:
>
> I am looking for a function which, given sequence1 and sequence2, would
> return whether sequence1 matches a subsequence of sequence2 allowing up to
> I insertions, D deletions, and S substitutions.
Kevin, if you don't want to allow deletions at all (I got the
impression this is what you're looking for) then the following code
will do the trick (and should be quite fast).
Is this generally useful, or would it usually be more useful to also
allow a (possibly limited) number of deletions?
from array import array
def kevin(str1, str2, max_substitutions, max_insertions):
"""check if it is possible to transform str1 into str2 given limitations
The limiations are the maximum allowed number of new characters inserted
and the maximum allowed number of character substitutions.
"""
# check simple cases which are obviously impossible
if not len(str1) <= len(str2) <= len(str1) + max_insertions:
return False
scores = array('L', [0] * (len(str2) - len(str1) + 1))
new_scores = scores[:]
for (str1_idx, char1) in enumerate(str1):
# make min() always take the other value in the first iteration of the
# inner loop
prev_score = len(str2)
for (n_insertions, char2) in enumerate(
str2[str1_idx:len(str2)-len(str1)+str1_idx+1]
):
new_scores[n_insertions] = prev_score = min(
scores[n_insertions] + (0 if char1 == char2 else 1),
prev_score
)
# swap scores <-> new_scores
scores, new_scores = new_scores, scores
return min(scores) <= max_substitutions
- Tal
More information about the Biopython
mailing list