[Biopython] Comparing two already aligned sequences quickly to count gaps/matches/mismatches

Thu Feb 6 07:56:36 EST 2025

Sadly I don't think that answers my query.

I don't actually need the number of gap-pairs (which can be inferred from
the other numbers anyway as they sum to the number of columns in the MSA).
So that's fine.

However, I am specifically needing the number of matches and mismatches.
That means I can't use the faster mode you're suggesting (which ignores the
non-gap characters in the sequences) :(

Moreover, I don't want the global counts, I want the pairwise counts. I
rather suspect based on the Biopython 1.85 code that I do just have to do
the inner loop over pairs of bases:

https://github.com/biopython/biopython/blob/biopython-185/Bio/Align/__init__.py#L3576

Thanks though - it looks like I wasn't overlooking something "off the
shelf" here.

Peter

On Thu, Feb 6, 2025 at 12:29 PM Michiel de Hoon <mjldehoon at yahoo.com> wrote:

> >
> This is with a FASTA input MSA (all sequences the same length with gap
> characters, and some pairs will have common gaps).
>
> Then you can use the `"fasta"` parser in `Bio.Align`, which parses the
> aligned sequences in C, so it should be fast.
> Then call the new `.counts`  method in Biopython 1.86dev with
> ignore_sequences=True to calculate the number of insertions and deletions
> quickly.
> This won't give you the number of gap-against-gap alignments, but those
> are not meaningful anyway.
>
> -Michiel
>
> On Thursday, February 6, 2025 at 07:51:48 PM GMT+9, Peter Cock <
> p.j.a.cock at googlemail.com> wrote:
>
>
> This is with a FASTA input MSA (all sequences the same length with gap
> characters, and some pairs will have common gaps).
>
> I have been loading this incrementally so only two sequences were in RAM
> at any point - and then switched to multiple threads (so at any point with
> K threads only 2*K sequences were loaded).
>
> The test case is only about 100MB on disk, 100 sequences each of 1 million
> base pairs, and takes about a minute or two (multi-threaded laptop).
>
> I don't think this will be memory constrained, so working on the entire
> MSA with a single thread is fine (if faster).
>
> Thank you,
>
> Peter
>
> On Thu, Feb 6, 2025 at 10:36 AM Michiel de Hoon <mjldehoon at yahoo.com>
> wrote:
>
> > I have lots of pairs of pre-aligned sequences (imported from an
> external MSA file),
>
> In which format is your MSA file?
>
> -Michiel
>
> On Thursday, January 30, 2025 at 11:59:33 PM GMT+9, Peter Cock <
> p.j.a.cock at googlemail.com> wrote:
>
>
> Hello all, and Michiel in particular,
>
> I am wondering if any of the pairwise alignment code in Bio.Align (much of
> which is written in C for speed) could help with this use case?:
>
> I have lots of pairs of pre-aligned sequences (imported from an external
> MSA file), for which I am doing something like this:
>
> ```python
> def count_matches_etc(query_seq, subject_seq):
>     assert len(query_seq) == len(subject_seq), "Should be same length"
>     matches = non_gap_mismatches = either_gapped = both_gapped = 0
>     for q, s in zip(query_seq, subject_seq, strict=True):
>         if q == "-" and s == "-":
>             both_gapped += 1
>         elif q == "-" or s == "-":
>             either_gapped += 1
>         elif q == s:
>             matches += 1
>         else:
>             non_gap_mismatches += 1
>     assert matches + non_gap_mismatches + either_gapped + both_gapped ==
> len(query_seq)
>     return matches, non_gap_mismatches, either_gapped, both_gapped
>
>
> # Test case
> assert (9, 1, 2, 1) == count_matches_etc("ACGTAC-TAC-GT", "AGGT-CGTAC-GT")
> ```
>
> Sticking with Python that could be optimized (e.g. I am currently using
> this with sequences of a million base pairs but few gaps), however I have
> written this example with clarity foremost in mind.
>
> Thank you,
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at biopython.org
> https://mailman.open-bio.org/mailman/listinfo/biopython
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20250206/0d8c0eeb/attachment.htm>