<div dir="auto"><p><br></p>

<p><strong>Subject:</strong> Reproducibility and evaluation instability in clinical ML pipelines using Biopython-based workflows</p>

<hr>

<p>Hi Biopython team,</p>

<p>I’m reaching out as a user of Biopython in the context of biomedical data processing pipelines for clinical machine learning research.</p>

<p>While developing a reproducible ICU prediction pipeline using MIMIC-IV-derived datasets (<a href="https://github.com/netanelcyber/PenuX">https://github.com/netanelcyber/PenuX</a>), we encountered an interesting observation that may be relevant to the broader bioinformatics community.</p>

<p>Although Biopython is primarily used for sequence and molecular data workflows, we found it useful as part of a broader preprocessing and integration pipeline alongside clinical datasets and downstream ML models.</p>

<hr>

<h3>Observation (methodological, not tool-specific)</h3>

<p>Across multiple modeling experiments, we observed that:</p>

<ul><li>standard performance metrics (e.g., AUROC) remain relatively stable across implementations</li><li>however, model reliability varies significantly under:

<ul><li>temporal validation vs random splits</li><li>different preprocessing strategies</li><li>subgroup stratification and missing-data regimes</li></ul>

</li></ul>

<p>These effects appear to be <strong>evaluation-design dependent rather than model-dependent</strong>, and raise broader questions about reproducibility in biomedical ML pipelines.</p>

<hr>

<h3>Why this may be relevant to Biopython users</h3>

<p>Even though Biopython is not directly responsible for clinical ML evaluation, many real-world pipelines combine:</p>

<ul><li>biological data processing (where Biopython is used)</li><li>clinical datasets (e.g., MIMIC-IV)</li><li>downstream predictive modeling</li></ul>

<p>This creates a gap where upstream reproducibility (sequence/biological processing) is strong, but downstream evaluation protocols may still introduce instability.</p>

<hr>

<h3>Question to the community</h3>

<p>I would be interested to hear whether others in the Biopython community have encountered:</p>

<ul><li>reproducibility issues when Biopython pipelines are integrated into larger ML systems</li><li>challenges in maintaining consistency across downstream evaluation setups</li><li>best practices for ensuring pipeline-level reproducibility beyond sequence processing itself</li></ul>

<hr>

<h3>Context</h3>

<p>Project reference (for reproducibility context only):

<a href="https://github.com/netanelcyber/PenuX">https://github.com/netanelcyber/PenuX</a></p>

<hr>

<p>Thank you for your work on Biopython — it remains a foundational tool in computational biology and bioinformatics workflows.</p>

<p>Best regards,</p></div>