[Biopython] Get all alignments of a sequence against another
Tal Einat
taleinat at gmail.com
Fri Mar 14 18:01:51 UTC 2014
On Fri, Mar 14, 2014 at 5:52 PM, Kevin Rue <kevin.rue at ucdconnect.ie> wrote:
> Hi Mary, Tal,
>
> In that case you describe, the solution to your problem is rather
> straightforward to implement.. I split it in two functions pasted below.
Kevin, that's a nice solution!
Here's a somewhat more efficient solution, based on the same basic
principals but implemented with some optimizations and non-trivial
index juggling. This will be included in future versions of
fuzzysearch. Raw code is below; for the next month or so a highlighted
version can be found at: dpaste.com/1728155/
I still feel we're reinventing the wheel here. Surely it is possible
to do this with BioPython. Unfortunately, I too couldn't easily figure
out how to do so from reading the documentation and a bit of trial and
error.
- Tal Einat
from collections import deque, defaultdict, namedtuple
from itertools import islice
Match = namedtuple('Match', ['start', 'end', 'dist'])
def find_near_matches_only_substitutions(subsequence, sequence,
max_substitutions):
"""search for near-matches of subsequence in sequence
This searches for near-matches, where the nearly-matching parts of the
sequence must meet the following limitations (relative to the subsequence):
* the number of character substitutions must be less than max_substitutions
* no deletions or insertions are allowed
"""
if not subsequence:
raise ValueError('Given subsequence is empty!')
# simple optimization: prepare some often used things in advance
_SUBSEQ_LEN = len(subsequence)
_SUBSEQ_LEN_MINUS_ONE = _SUBSEQ_LEN - 1
# prepare quick lookup of where a character appears in the subsequence
char_indexes_in_subsequence = defaultdict(list)
for (index, char) in enumerate(subsequence):
char_indexes_in_subsequence[char].append(index)
# we'll iterate over the sequence once, but the iteration is split into two
# for loops; therefore we prepare an iterator in advance which will be used
# in for of the loops
sequence_enum_iter = enumerate(sequence)
# We'll count the number of matching characters assuming various attempted
# alignments of the subsequence to the sequence. At any point in the
# sequence there will be N such alignments to update. We'll keep
# these in a "circular array" (a.k.a. a ring) which we'll rotate after each
# iteration to re-align the indexing.
# Initialize the candidate counts by iterating over the first N-1 items in
# the sequence. No possible matches in this step!
candidates = deque([0], maxlen=_SUBSEQ_LEN)
for (index, char) in islice(sequence_enum_iter, _SUBSEQ_LEN_MINUS_ONE):
for subseq_index in [idx for idx in
char_indexes_in_subsequence[char] if idx <= index]:
candidates[subseq_index] += 1
candidates.appendleft(0)
matches = []
# From the N-th item onwards, we'll update the candidate counts exactly as
# above, and additionally check if the part of the sequence whic began N-1
# items before the current index was a near enough match to the given
# sub-sequence.
for (index, char) in sequence_enum_iter:
for subseq_index in char_indexes_in_subsequence[char]:
candidates[subseq_index] += 1
# rotate the ring of candidate counts
candidates.rotate(1)
# fetch the count for the candidate which started N-1 items ago
n_substitutions = _SUBSEQ_LEN - candidates[0]
# set the count for the next index to zero
candidates[0] = 0
# if the candidate had few enough mismatches, yield a match
if n_substitutions <= max_substitutions:
matches.append(Match(
start=index - _SUBSEQ_LEN_MINUS_ONE,
end=index + 1,
dist=n_substitutions,
))
return matches
More information about the Biopython
mailing list