[Biopython] Comparing two already aligned sequences quickly to count gaps/matches/mismatches
Michiel de Hoon
mjldehoon at yahoo.com
Thu Feb 6 07:29:10 EST 2025
> This is with a FASTA input MSA (all sequences the same length with gap characters, and some pairs will have common gaps).
Then you can use the `"fasta"` parser in `Bio.Align`, which parses the aligned sequences in C, so it should be fast.Then call the new `.counts` method in Biopython 1.86dev with ignore_sequences=True to calculate the number of insertions and deletions quickly.This won't give you the number of gap-against-gap alignments, but those are not meaningful anyway.
-Michiel
On Thursday, February 6, 2025 at 07:51:48 PM GMT+9, Peter Cock <p.j.a.cock at googlemail.com> wrote:
This is with a FASTA input MSA (all sequences the same length with gap characters, and some pairs will have common gaps).
I have been loading this incrementally so only two sequences were in RAM at any point - and then switched to multiple threads (so at any point with K threads only 2*K sequences were loaded).
The test case is only about 100MB on disk, 100 sequences each of 1 million base pairs, and takes about a minute or two (multi-threaded laptop).
I don't think this will be memory constrained, so working on the entire MSA with a single thread is fine (if faster).
Thank you,
Peter
On Thu, Feb 6, 2025 at 10:36 AM Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> I have lots of pairs of pre-aligned sequences (imported from an external MSA file),
In which format is your MSA file?
-Michiel
On Thursday, January 30, 2025 at 11:59:33 PM GMT+9, Peter Cock <p.j.a.cock at googlemail.com> wrote:
Hello all, and Michiel in particular,
I am wondering if any of the pairwise alignment code in Bio.Align (much of which is written in C for speed) could help with this use case?:
I have lots of pairs of pre-aligned sequences (imported from an external MSA file), for which I am doing something like this:
```python
def count_matches_etc(query_seq, subject_seq):
assert len(query_seq) == len(subject_seq), "Should be same length"
matches = non_gap_mismatches = either_gapped = both_gapped = 0
for q, s in zip(query_seq, subject_seq, strict=True):
if q == "-" and s == "-":
both_gapped += 1
elif q == "-" or s == "-":
either_gapped += 1
elif q == s:
matches += 1
else:
non_gap_mismatches += 1
assert matches + non_gap_mismatches + either_gapped + both_gapped == len(query_seq)
return matches, non_gap_mismatches, either_gapped, both_gapped
# Test case
assert (9, 1, 2, 1) == count_matches_etc("ACGTAC-TAC-GT", "AGGT-CGTAC-GT")
```
Sticking with Python that could be optimized (e.g. I am currently using this with sequences of a million base pairs but few gaps), however I have written this example with clarity foremost in mind.
Thank you,
Peter_______________________________________________
Biopython mailing list - Biopython at biopython.org
https://mailman.open-bio.org/mailman/listinfo/biopython
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20250206/c071daea/attachment-0001.htm>
More information about the Biopython
mailing list