[Biopython] slice a record in two and writing both records

Dilara Ally dilara.ally at gmail.com
Thu Jul 19 15:51:35 UTC 2012


If I have a function (modify_record) that slices up a SeqRecord into sub
records and then returns the sliced record if it has a certain length (for
e.g. the sliced record needs to be greater than 40bp), sometimes the
original record when sliced will have two different records both greater
than 40bp.  I want to keep both sliced reads and rewrite them as separate
records into a single fastq file.  Here is my code:

def modify_record(frec, win, len_threshold):
    quality_scores = array(frec.letter_annotations["phred_quality"])
    all_window_qc = slidingWindow(quality_scores, win,1)
    track_qc = windowQ(all_window_qc)
    myzeros = boolean_array(track_qc, q_threshold,win)
    Nrec = slice_points(myzeros,win)[0][1]-1
    where_to_slice = slice_points(myzeros,win)[1]
    where_to_slice.append(len(frec)+win)
    sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
    return sub_record

q_threshold = 20
win = 5
len_threshold = 30

from Bio import SeqIO
from numpy import *
good_reads = (rec for rec in SeqIO.parse("hiseq_pe_test.fastq", "fastq") if
array(rec.letter_annotations["phred_quality"]).mean() >= q_threshold)
count = SeqIO.write(good_reads, "temp.fastq", "fastq")
print "Saved %i reads" % count

newly_filtered=[]
for rec in SeqIO.parse("temp.fastq", "fastq"):
    s = modify_record(rec, win, len_threshold)
    newly_filtered.append(s)
    SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq")

This writes only the first sub_record even when there are more than 1 that
have a len >40bp. I've tried this as a generator expression and I'm still
getting just the first sub_record.   I'd also prefer to not to use append
as it was previously suggested that this can lead to problems if you run
the script more than once.  Instead, I want to employ a generator
expression - but I'm still getting used to the idea of generator
expressions.

My second question is more general.  Generator expressions are more memory
efficient than a list comprehension, but how are they better than just a
simple loop that pulls in a single record, does something and then writes
that record? Is it just a time issue?

Many thanks for the help!



More information about the Biopython mailing list