[Biopython] large fasta files

Tue Sep 9 13:04:38 UTC 2014

On Tue, Sep 9, 2014 at 1:55 PM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
> Hi,
>
> So the id I am matching to are in a set .

Good :)

> if seq.id in lset_id:
>    list_seq.append(seq)

This looks like you are building a list of SeqRecord object in memory.
If you are looking for a large number of entries in the FASTA file, this
will consume a lot of RAM (and if you run out or RAM will suddenly
slow down as swap space is used instead).

I would use a generator approach to write out the records you want
immediately, see the "Filtering a sequence file" example in the
Cookbook chapter of the Biopython Tutorial:

http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

In your case, replace "sff" with "fasta" and adjust how the set of
wanted identifiers is loaded.

Peter