[Biopython] entire sequence file is unintentionally being loaded
Liam Thompson
dejmail at gmail.com
Wed Nov 9 18:42:53 UTC 2016
Hi everyone
I have written a demultiplexing script for an Illumina NGS library,
where I analyse each pair sequence, find the barcode-primer from a
dictionary, and assign the reads to a sample file. I'm using python2.7
for compatibility reasons on a Linux machine, and the most recent biopython.
Obviously, I don't want to load the entire sequence file into memory
which is what I have tried to avoid by indexing the reads with biopy
first which Peter helped with on a previous email.
So I take the index dictionary like object I receive from the index
function and merge the values with zip so that I have the paired reads
information in one tuple.
for r1, r2 in zip(self.R1.values(), self.R2.values()):
pair_seq_dict = {'r1' : r1, 'r2' : r2}
I thought fetching the R1 and R2 values like this would essentially
continuously query the index until the index has run out of values to
return. I've obviously missed something or am implementing it wrong.
I have checked the output log where I log the output of the values in
the code, and the entire file is not read into memory. Or at least that
is what displaying the variables contents says. They only ever seem to
have just the R1 and R2 equivalent Seq objects (so two sequences worth
of info).
So how my question is how do I find out what is going on? What have I
misunderstood? What is the best way for me to iterate over the index
given that I have two indices (R1 and R2) and analyse the reads as a
pair. I suspect the it is the .values() command where I am going wrong.
I really appreciate any comments or help
Kind regards
Liam Thompson
Mölndal
More information about the Biopython
mailing list