[Biopython] large fasta files
Ivan Gregoretti
ivangreg at gmail.com
Tue Sep 9 15:20:46 UTC 2014
Hello Jurgens and Peter,
I use these strategy and it is extremely fast:
file_handle = open('file_name.fa', 'r')
ids_set = set()
for i in SeqIO.parse(file_handle, 'fasta'):
ids_set.add(i.id)
I hope this helps.
Ivan
Ivan Gregoretti, PhD
On Tue, Sep 9, 2014 at 9:11 AM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
> Thanks for all the help much appreciated!
>
>
> On 9 September 2014 15:04, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> On Tue, Sep 9, 2014 at 1:55 PM, Jurgens de Bruin <debruinjj at gmail.com>
>> wrote:
>> > Hi,
>> >
>> > So the id I am matching to are in a set .
>>
>> Good :)
>>
>> > if seq.id in lset_id:
>> > list_seq.append(seq)
>>
>> This looks like you are building a list of SeqRecord object in memory.
>> If you are looking for a large number of entries in the FASTA file, this
>> will consume a lot of RAM (and if you run out or RAM will suddenly
>> slow down as swap space is used instead).
>>
>> I would use a generator approach to write out the records you want
>> immediately, see the "Filtering a sequence file" example in the
>> Cookbook chapter of the Biopython Tutorial:
>>
>> http://biopython.org/DIST/docs/tutorial/Tutorial.html
>> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>>
>> In your case, replace "sff" with "fasta" and adjust how the set of
>> wanted identifiers is loaded.
>>
>> Peter
>
>
>
>
> --
> Regards/Groete/Mit freundlichen Grüßen/recuerdos/meilleures salutations/
> distinti saluti/siong/duì yú/привет
>
> Jurgens de Bruin
>
> _______________________________________________
> Biopython mailing list - Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython
More information about the Biopython
mailing list