<div dir="ltr"><div>Sadly I don't think that answers my query.</div><div><br></div><div>I don't actually need the number of gap-pairs (which can be inferred from the other numbers anyway as they sum to the number of columns in the MSA). So that's fine.</div><div><br></div><div>However, I am specifically needing the number of matches and mismatches. That means I can't use the faster mode you're suggesting (which ignores the non-gap characters in the sequences) :(</div><div><br></div><div>Moreover, I don't want the global counts, I want the pairwise counts. I rather suspect based on the Biopython 1.85 code that I do just have to do the inner loop over pairs of bases:</div><div><br></div><div><a href="https://github.com/biopython/biopython/blob/biopython-185/Bio/Align/__init__.py#L3576">https://github.com/biopython/biopython/blob/biopython-185/Bio/Align/__init__.py#L3576</a></div><div><br></div><div>Thanks though - it looks like I wasn't overlooking something "off the shelf" here.</div><div><br></div><div>Peter</div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Thu, Feb 6, 2025 at 12:29 PM Michiel de Hoon <<a href="mailto:mjldehoon@yahoo.com">mjldehoon@yahoo.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div style="font-family:Helvetica Neue,Helvetica,Arial,sans-serif;font-size:10px"><div></div>

        <div dir="ltr">> <span><div>This is with a FASTA input MSA (all sequences the same length with gap characters, and some pairs will have common gaps).</div></span></div><div dir="ltr"><br></div><div dir="ltr">Then you can use the `"fasta"` parser in `Bio.Align`, which parses the aligned sequences in C, so it should be fast.</div><div dir="ltr">Then call the new `.counts`  method in Biopython 1.86dev with ignore_sequences=True to calculate the number of insertions and deletions quickly.</div><div dir="ltr">This won't give you the number of gap-against-gap alignments, but those are not meaningful anyway.</div><div dir="ltr"><br></div><div dir="ltr">-Michiel<br></div><div><br></div>

        

        </div><div id="m_7082227319688521718yahoo_quoted_8899767467">

            <div style="font-family:"Helvetica Neue",Helvetica,Arial,sans-serif;font-size:13px;color:rgb(38,40,42)">

                

                <div>

                        On Thursday, February 6, 2025 at 07:51:48 PM GMT+9, Peter Cock <<a href="mailto:p.j.a.cock@googlemail.com" target="_blank">p.j.a.cock@googlemail.com</a>> wrote:

                    </div>

                    <div><br></div>

                    <div><br></div>

                

                

                <div><div id="m_7082227319688521718yiv7863050864"><div><div dir="ltr"><div>This is with a FASTA input MSA (all sequences the same length with gap characters, and some pairs will have common gaps).</div><div><br clear="none"></div><div>I have been loading this incrementally so only two sequences were in RAM at any point - and then switched to multiple threads (so at any point with K threads only 2*K sequences were loaded).</div><div><br clear="none"></div><div>The test case is only about 100MB on disk, 100 sequences each of 1 million base pairs, and takes about a minute or two (multi-threaded laptop).<br clear="none"></div><div><br clear="none"></div><div><div>I don't think this will be memory constrained, so working on the entire MSA with a single thread is fine (if faster).<br clear="none"></div><div><br clear="none"></div><div>Thank you,</div><div><br clear="none"></div><div>Peter<br clear="none"></div></div></div><br clear="none"><div id="m_7082227319688521718yiv7863050864yqt97629"><div><div dir="ltr">On Thu, Feb 6, 2025 at 10:36 AM Michiel de Hoon <<a rel="nofollow noopener noreferrer" shape="rect" href="mailto:mjldehoon@yahoo.com" target="_blank">mjldehoon@yahoo.com</a>> wrote:<br clear="none"></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div style="font-family:Helvetica Neue,Helvetica,Arial,sans-serif;font-size:10px"><div></div>

        <div dir="ltr">> <span>I have lots of pairs of pre-aligned sequences (imported from an external MSA file),</span></div><div dir="ltr"><span><br clear="none"></span></div><div dir="ltr"><span>In which format is your MSA file?</span></div><div dir="ltr"><span><br clear="none"></span></div><div dir="ltr"><span>-Michiel<br clear="none"></span></div><div><br clear="none"></div>

        

        </div><div id="m_7082227319688521718yiv7863050864m_8475683859613579565yahoo_quoted_9753999089">

            <div style="font-family:Helvetica,Arial,sans-serif;font-size:13px;color:rgb(38,40,42)">

                

                <div>

                        On Thursday, January 30, 2025 at 11:59:33 PM GMT+9, Peter Cock <<a rel="nofollow noopener noreferrer" shape="rect" href="mailto:p.j.a.cock@googlemail.com" target="_blank">p.j.a.cock@googlemail.com</a>> wrote:

                    </div>

                    <div><br clear="none"></div>

                    <div><br clear="none"></div>

                

                

                <div><div id="m_7082227319688521718yiv7863050864m_8475683859613579565yiv8603226725"><div dir="ltr">Hello all, and Michiel in particular,<br clear="none"><br clear="none">I am wondering if any of the pairwise alignment code in Bio.Align (much of which is written in C for speed) could help with this use case?:<br clear="none"><br clear="none">I have lots of pairs of pre-aligned sequences (imported from an external MSA file), for which I am doing something like this:<br clear="none"><br clear="none">```python<br clear="none">def count_matches_etc(query_seq, subject_seq):<br clear="none">    assert len(query_seq) == len(subject_seq), "Should be same length"<br clear="none">    matches = non_gap_mismatches = either_gapped = both_gapped = 0<br clear="none">    for q, s in zip(query_seq, subject_seq, strict=True):<br clear="none">        if q == "-" and s == "-":<br clear="none">            both_gapped += 1<br clear="none">        elif q == "-" or s == "-":<br clear="none">            either_gapped += 1<br clear="none">        elif q == s:<br clear="none">            matches += 1<br clear="none">        else:<br clear="none">            non_gap_mismatches += 1<br clear="none">    assert matches + non_gap_mismatches + either_gapped + both_gapped == len(query_seq)<br clear="none">    return matches, non_gap_mismatches, either_gapped, both_gapped<br clear="none"><br clear="none"><br clear="none"># Test case<br clear="none">assert (9, 1, 2, 1) == count_matches_etc("ACGTAC-TAC-GT", "AGGT-CGTAC-GT")<br clear="none">```<br clear="none"><br clear="none">Sticking with Python that could be optimized (e.g. I am currently using this with sequences of a million base pairs but few gaps), however I have written this example with clarity foremost in mind.<br clear="none"><div><br clear="none"></div><div>Thank you,</div><div><br clear="none"></div>Peter</div>

</div>_______________________________________________<br clear="none">Biopython mailing list  -  <a rel="nofollow noopener noreferrer" shape="rect" href="mailto:Biopython@biopython.org" target="_blank">Biopython@biopython.org</a><br clear="none"><a rel="nofollow noopener noreferrer" shape="rect" href="https://mailman.open-bio.org/mailman/listinfo/biopython" target="_blank">https://mailman.open-bio.org/mailman/listinfo/biopython</a><br clear="none"></div>

            </div>

        </div></div></blockquote></div></div>

</div></div></div>

            </div>

        </div></div></blockquote></div>