[Biopython] Python equivalent of the Perl String::Approx module for approximate matching?

Tal Einat taleinat at gmail.com
Sat Mar 15 18:59:07 UTC 2014


On Thu, Mar 13, 2014 at 8:08 PM, Kevin Rue <kevin.rue at ucdconnect.ie> wrote:
> Hi Tal,
>
> I finally had the time to look at your function in details, and understood
> how it works and why each line is there. Thanks, even though it doesn't do
> exactly what I had in mind, it does what the docstring says :)
> I'd call your "kevin" function "approxEqual(str1, str2, max_ins, max_sub)".
>
>
> When I understood your code, I thought of a way to increase the speed of
> your function, but ended with a (fast) function actually doing something
> slightly different. This one would actually only look at the str2 substrings
> from str1_idx but which are no longer than len(str1)+max_insertions. Longer
> str2 are pointless as they imply that str1 requires more insertions at the
> start than allowed to match str2.
> I would call that function "approxStartsWith(str1, str2, max_ins, max_sub)"
>
> def approxStartsWith(str1, str2, max_substitutions, max_insertions):
>     """check if it is possible to map str1 to the start of str2 given
> limitations
>
>     The limitations are the maximum allowed number of new characters
> inserted
>     and the maximum allowed number of character substitutions.
>     """
>     # check simple cases which are obviously impossible
>     if not len(str1) <= len(str2):
>         return False
>
>     scores = array('L', [0] * (max_insertions + 1))
>     new_scores = scores[:]
>
>     for (str1_idx, char1) in enumerate(str1):
>         # make min() always take the other value in the first iteration of
> the
>         # inner loop
>         prev_score = len(str2)
>         for (n_insertions, char2) in enumerate(
>                 str2[str1_idx:max_insertions+str1_idx+1]
>         ):
>             new_scores[n_insertions] = prev_score = min(scores[n_insertions]
> + (0 if char1 == char2 else 1), prev_score )
>
>         # swap scores <-> new_scores
>         scores, new_scores = new_scores, scores
>
>     return min(scores) <= max_substitutions
>
>
> Now, an approxWithin(str1, str2, max_ins, max_sub) could be simply
> implemented as:
>
> def approxWithin(str1, str2, max_ins, max_sub):
> """check if it is possible to find str1 within str2 given limitations
>
> The limitations are the maximum allowed number of new characters inserted
> and the maximum allowed number of character substitutions.
> """
> for str2_idx in range(len(str2)-len(str1)+1):
> print (str2_idx)
> result = approxStartsWith(str1, str2[str2_idx:], max_ins, max_sub)
> if result:
> return result
> return False
>
> Comments?

Aha! So you are, in fact, searching for a short sequence in a long
sequence (or many such long sequences). This -is- exactly what
fuzzysearch is meant for!


Regarding the code you posted, it looks like it would work (though I
haven't tested it). It certainly isn't very efficient, however, since
it makes many copies of long parts of str2 (see "str2[str2_idx:]"). It
is also fairly straightforward, leaving some room for further
optimization. I do like that it is very readable and easy to
understand! Well, except for the loop in approxStartsWith(), but
that's based on my own code...


Inspired by your use-case, I've added highly generic fuzzy searching
functionality to fuzzysearch. You can now limit the number of
substitutions, insertions and deletions as well as their total (i.e.
limit the Levenshtein distance). You can also limit only some of these
as you like.

Specifically, this supports your use-case of searching for fuzzy
matches allowing only a limited number of substitutions and
insertions, but no deletions.

The user-friendly utility function fuzzysearch.find_near_matches() now
accepts parameters for limiting the substitutions etc., and chooses a
suitable implementation based on the given parameters.

I haven't yet implemented an optimized search function allowing only
substitutions and insertions. If the current version not fast enough
for your needs, there are plenty of optimizations still to be done.

I'd be happy if you could give it a whirl and tell me what happens! I
haven't released it yet, but you can install the latest development
version using pip (you'll need to have git installed for this):

pip install --upgrade
git+git://github.com/taleinat/fuzzysearch.git#egg=fuzzysearch

- Tal



More information about the Biopython mailing list