[Biopython] Get all alignments of a sequence against another

Kevin Rue kevin.rue at ucdconnect.ie
Fri Mar 14 09:16:45 UTC 2014


Hi Mary,

There is one blurry area in your question: how exactly do you define "a
location where your small_sequence aligns" ?
>From your example, it seems you're not looking for exact matches, but you
allow in this case 1 mismatch. Is it a maximal number of mismatches? Do you
also want to allow indels? Do you want to control the number of insertions,
deletions, substitutions separately? Is a match a local alignment above a
score threshold?

I would suggest that you have a look at the definition of the Levenshtein
distance.( see the example:
http://en.wikipedia.org/wiki/Levenshtein_distance#Example).
If this metric suits you, for instance to find all the matches of the
small_sequences in the large_sequence with a maximal edit distance of 1,
you can use one of the Python packages implementing the Levenshtein
distance, like "fuzzysearch" (https://pypi.python.org/pypi/fuzzysearch/0.2.0)
this way:

>>> import fuzzysearch
>>>
fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS",
1)

The output will find two matches.
Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, dist=0)]

BUG:
I did notice that the second match is reported twice instead and I assume
this is a bug where the first match was somehow replaced by the second,
which is why I copied Tal (the developer of this package) to this email

Another example where I added you sequence (with a mismatch) a third time:

>>>
fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS",
1)

returns
Out[9]:
[Match(start=42, end=52, dist=1),
 Match(start=99, end=109, dist=0),
 Match(start=99, end=109, dist=0)]

You can see three matches, one of the mismatched sequence was detected
correctly (edit distance of 1), but the bug seems to duplicate the last
match and replace the one before the last match with it.

Tal, can you fix that? I will add the issue to your repository :)

Cheers
Kevin




On 13 March 2014 19:57, Mary Kindall <mary.kindall at gmail.com> wrote:

> This is a primitive question but somehow I could not find a solution to it.
> I have two sequences 'large' and 'small' as given below.
>
> >large
>
> XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS
>
>
> >small
> GGGTTVTTSS
>
>
> I need to align the 'small' sequence to the 'large' sequence. Clearly there
> are two places where it can be aligned. I need to get indices of both the
> locations. I was trying BioPython's "pairwise2.align.globalms" function but
> it is only able to align to the second position.
>
>
>
> pairwise2.align.globalms(largeStr, smallStr, 2, -1, -1, 0,
> penalize_end_gaps=False)
> Ans:
>
> [('XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS',
>
> '-----------------------------------------------------------------------------------------GGGTTLTTSS',
> 20.0,
> 0,
> 99)]
>
>
>
> Which parameter can I change here or which other pachage/lightweight free
> software can compute this?
>
> --
> Mary
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>



-- 
Kévin RUE-ALBRECHT
Wellcome Trust Computational Infection Biology PhD Programme
University College Dublin
Ireland
http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en




More information about the Biopython mailing list