<html><head></head><body><div class="ydpa682210eyahoo-style-wrap" style="font-family:Helvetica Neue, Helvetica, Arial, sans-serif;font-size:10px;"><div></div>
<div dir="ltr" data-setdir="false">Hi all,</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false">You may also want to try reading the Fasta/Fastq file in larger chunks, rather than line-by-line. I have tried this in the past, and it gave me much better performance.</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false">This may be a good opportunity to try a SeqRecord-free parser, i.e. a parser that returns Seq objects instead of SeqRecord objects. I never quite understood why we have SeqRecords alongside Seq objects.</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false">Best,</div><div dir="ltr" data-setdir="false">-MIchiel</div><div><br></div>
</div><div id="yahoo_quoted_5210870872" class="yahoo_quoted">
<div style="font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:13px;color:#26282a;">
<div>
On Wednesday, November 26, 2025 at 06:38:52 PM GMT+9, Peter Cock <p.j.a.cock@googlemail.com> wrote:
</div>
<div><br></div>
<div><br></div>
<div><div dir="ltr">Hi Terry,<br clear="none"><br clear="none">That's a very even handed reply. I hadn't considered speaking to AI<br clear="none">tools with Voice recognition, but Amazon Alexa isn't great and has<br clear="none">probably lowered my expectations - grin.<br clear="none"><br clear="none">The SeqRecord overhead is big, especially making the ASCII quality<br clear="none">string into a list of integers. You often don't need this, and can<br clear="none">stick with strings. See eg:<br clear="none"><br clear="none"><a shape="rect" href="https://www.open-bio.org/2009/09/25/biopython-fast-fastq/" target="_blank">https://www.open-bio.org/2009/09/25/biopython-fast-fastq/</a><br clear="none"><br clear="none">In the short term, a much easier experiment to try is switching our<br clear="none">Pure Python FASTQ parser to work in bytes mode (binary handles) to<br clear="none">avoid the bytes to unicode conversion of the sequence and quality<br clear="none">strings.<br clear="none"><br clear="none">Peter<br clear="none"><div class="yqt1432880540" id="yqtfd65057"><br clear="none">On Tue, Nov 25, 2025 at 7:56 PM Jones Kelly, Terence Carleton<br clear="none"><<a shape="rect" ymailto="mailto:terence.jones@charite.de" href="mailto:terence.jones@charite.de">terence.jones@charite.de</a>> wrote:<br clear="none">><br clear="none">> Hi all<br clear="none">><br clear="none">> Thanks for the replies and I'm sorry to be so slow in replying.<br clear="none">><br clear="none">> Peter - I’m sceptical, too. I thought it highly unlikely that BioPython would want to add a dependence on Rust. The C code could be used to get the same speed-up. The automatic detection of compressed input (including on stdin) would be lost, but perhaps could be added. Dan - maybe Claude could make the changes for a pr. I wouldn’t be surprised, even with such a large codebase.<br clear="none">><br clear="none">> Regarding the wider points made by Peter in his blog post, I agree with probably all of it. I highly doubt that the current trajectory (in terms of energy use) of these tools is sustainable, and I regularly wonder about actual cost (to the planet) I am incurring when using Claude, especially when it’s doing something I could do in emacs or with e.g., a perl one-liner. But I get stuff done about 20x as fast when using Claude, and I just speak to it (using Dictation on OS X), barely typing a thing. That’s all an incredible change after 40+ years of having to get thoughts out of my brain and into the computer one keystroke at a time. It feels like magic. BUT, I have to watch Claude very carefully, look at its code, make it write many tests and explicitly prove that it has managed to do what was intended. That’s not so different from watching yourself or doing code review of a junior (very energetic ever-cheerful) programmer.<br clear="none">><br clear="none">> I agree that an AI (Claude, certainly) could be used to very quickly make general improvements to a codebase with very little risk (assuming a test suite exists). I’ve done that quite a few times, e.g., telling Claude I no longer need support for Python 3.X and to modernise the typing hints for 3.X+1, or to add typing hints, etc. It’s also good at producing documentation.<br clear="none">><br clear="none">> I’m not sure where this leaves us. If there’s agreement that a C extension would be good to have, the one Claude made for my prseq project benchmark suite is tested and could be used. I haven’t looked at the BioPython code closely enough to know how much work that would be, though.<br clear="none">><br clear="none">> Thanks again for the thoughtful replies. There was an earlier one asking why the BioPython code was slower than Claude’s minimal pure Python, and suggesting that would be due to the overhead of making SeqRecords. I think that must be the case. The impact is much more severe with FASTQ than FASTA, so I guess dealing with the quality strings must be quite expensive.<br clear="none">><br clear="none">> Best,<br clear="none">> Terry<br clear="none">><br clear="none">> From: Peter Cock <<a shape="rect" ymailto="mailto:p.j.a.cock@googlemail.com" href="mailto:p.j.a.cock@googlemail.com">p.j.a.cock@googlemail.com</a>><br clear="none">> Date: Monday, 24. November 2025 at 10:41<br clear="none">> To: Jones Kelly, Terence Carleton <<a shape="rect" ymailto="mailto:terence.jones@charite.de" href="mailto:terence.jones@charite.de">terence.jones@charite.de</a>><br clear="none">> Cc: <a shape="rect" ymailto="mailto:biopython@biopython.org" href="mailto:biopython@biopython.org">biopython@biopython.org</a> <<a shape="rect" ymailto="mailto:biopython@biopython.org" href="mailto:biopython@biopython.org">biopython@biopython.org</a>><br clear="none">> Subject: [ext] Re: [Biopython] A possibility for speeding up FASTA/FASTQ reading in BioPython<br clear="none">><br clear="none">> Hello Terry,<br clear="none">><br clear="none">> I just posted a blog about my thoughts on receiving generative AI<br clear="none">> contributions as an Open Source project maintainer:<br clear="none">><br clear="none">> <a shape="rect" href="https://blastedbio.blogspot.com/2025/11/thoughts-on-generative-ai-contributions.html" target="_blank">https://blastedbio.blogspot.com/2025/11/thoughts-on-generative-ai-contributions.html</a><br clear="none">><br clear="none">> I am sceptical, and in this case adding a Rust dependency to Biopython<br clear="none">> seems too much to ask. I think you could get similar performance gains<br clear="none">> with C (which we do use) where at least the maintainers have some<br clear="none">> experience. However, even there, gains may not make the additional<br clear="none">> complexity and maintenance burden worthwhile.<br clear="none">><br clear="none">> Thank you for writting and asking, rather than suprising everyone with<br clear="none">> a large pull request.<br clear="none">><br clear="none">> Peter<br clear="none">><br clear="none">> P.S. Cross reference <a shape="rect" href="https://github.com/biopython/biopython/pull/5085" target="_blank">https://github.com/biopython/biopython/pull/5085</a><br clear="none">><br clear="none">> On Tue, Nov 11, 2025 at 10:00 PM Jones Kelly, Terence Carleton<br clear="none">> <<a shape="rect" ymailto="mailto:terence.jones@charite.de" href="mailto:terence.jones@charite.de">terence.jones@charite.de</a>> wrote:<br clear="none">> ><br clear="none">> > Hi all<br clear="none">> ><br clear="none">> > I regularly process reasonably large FASTQ (hundreds of billions of sequencing reads) and FASTA files using BioPython. For some years I've been meaning to implement a FASTQ/FASTA reader in a compiled language and add Python bindings to improve the speed. I could've done this in C but I spent some decades writing C and I wanted to learn something new, so I considered a few languages. Because Rust makes it very easy to create Python bindings, I decided to give it a try. I thought I'd get going by asking the Claude CLI to write me some Rust. That turned out to be a much, much better experience than I had anticipated. With Claude I played with several implementations, keeping track of timing. Claude also wrote some tests. To compare what I was seeing I got Claude to write a pure Python version, a pure C version, Python bindings to the C, and to create a benchmark suite. From what I can tell, the Rust/Python (and the C/Python) FASTA reading is twice as fast as BioPython and FASTQ reading is four times as fast. I didn't write a single line of code. I just did some minimal cleaning up when things were already far along. I've been using the code for the last month or two with no problems.<br clear="none">> ><br clear="none">> > The repo is at <a shape="rect" href="https://github.com/VirologyCharite/prseq" target="_blank">https://github.com/VirologyCharite/prseq</a> (prseq = Python/Rust for sequences). You'll find the benchmark results on that page. There are still some small things I would adjust in the API. BTW, Claude also wrote the README (which should definitely be improved).<br clear="none">> ><br clear="none">> > I am wondering if there might be interest in incorporating this into BioPython. I don't know if there are any Rust dependencies in BioPython but I know that there are some C extensions. We could use either, as their speeds are comparable. If there's interest, I'd be happy to help (or to do it all, after some discussion and maybe with some guidance).<br clear="none">> ><br clear="none">> > Thanks very much for all the work on BioPython. It's really been a pleasure to use the code over the last dozen years or so.<br clear="none">> ><br clear="none">> > Terry Jones<br clear="none">> ><br clear="none">> ><br clear="none">> > _______________________________________________<br clear="none">> > Biopython mailing list - <a shape="rect" ymailto="mailto:Biopython@biopython.org" href="mailto:Biopython@biopython.org">Biopython@biopython.org</a><br clear="none">> > <a shape="rect" href="https://mailman.open-bio.org/mailman/listinfo/biopython" target="_blank">https://mailman.open-bio.org/mailman/listinfo/biopython</a><br clear="none">_______________________________________________<br clear="none">Biopython mailing list - <a shape="rect" ymailto="mailto:Biopython@biopython.org" href="mailto:Biopython@biopython.org">Biopython@biopython.org</a><br clear="none"><a shape="rect" href="https://mailman.open-bio.org/mailman/listinfo/biopython" target="_blank">https://mailman.open-bio.org/mailman/listinfo/biopython</a><br clear="none"></div></div></div>
</div>
</div></body></html>