[Biopython] large fasta files
Peter Cock
p.j.a.cock at googlemail.com
Tue Sep 9 13:04:38 UTC 2014
On Tue, Sep 9, 2014 at 1:55 PM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
> Hi,
>
> So the id I am matching to are in a set .
Good :)
> if seq.id in lset_id:
> list_seq.append(seq)
This looks like you are building a list of SeqRecord object in memory.
If you are looking for a large number of entries in the FASTA file, this
will consume a lot of RAM (and if you run out or RAM will suddenly
slow down as swap space is used instead).
I would use a generator approach to write out the records you want
immediately, see the "Filtering a sequence file" example in the
Cookbook chapter of the Biopython Tutorial:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
In your case, replace "sff" with "fasta" and adjust how the set of
wanted identifiers is loaded.
Peter
More information about the Biopython
mailing list