[Biopython] Python equivalent of the Perl String::Approx module for approximate matching?
    Tal Einat 
    taleinat at gmail.com
       
    Wed Mar 12 18:00:16 UTC 2014
    
    
  
> Kevin wrote:
>
> I am looking for a function which, given sequence1 and sequence2, would
> return whether sequence1 matches a subsequence of sequence2 allowing up to
> I insertions, D deletions, and S substitutions.
Kevin, if you don't want to allow deletions at all (I got the
impression this is what you're looking for) then the following code
will do the trick (and should be quite fast).
Is this generally useful, or would it usually be more useful to also
allow a (possibly limited) number of deletions?
from array import array
def kevin(str1, str2, max_substitutions, max_insertions):
    """check if it is possible to transform str1 into str2 given limitations
    The limiations are the maximum allowed number of new characters inserted
    and the maximum allowed number of character substitutions.
    """
    # check simple cases which are obviously impossible
    if not len(str1) <= len(str2) <= len(str1) + max_insertions:
        return False
    scores = array('L', [0] * (len(str2) - len(str1) + 1))
    new_scores = scores[:]
    for (str1_idx, char1) in enumerate(str1):
        # make min() always take the other value in the first iteration of the
        # inner loop
        prev_score = len(str2)
        for (n_insertions, char2) in enumerate(
                str2[str1_idx:len(str2)-len(str1)+str1_idx+1]
        ):
            new_scores[n_insertions] = prev_score = min(
                scores[n_insertions] + (0 if char1 == char2 else 1),
                prev_score
            )
        # swap scores <-> new_scores
        scores, new_scores = new_scores, scores
    return min(scores) <= max_substitutions
- Tal
    
    
More information about the Biopython
mailing list