[Biopython] fastq manipulations speed
natassa
natassa_g_2000 at yahoo.com
Sun Mar 17 20:04:08 UTC 2013
Hi biopython list,
I have a few fasta files that come from processing fastq illumina reads for quality, polyAs adaptors etc, but i need to get their associated qualities back. I wrote a simple script that calls the following 2 funbctions, which I think are the fastest way to deal with fastq-fasta files in Biopython, but the script is awfully slow:
For example, for one of my files, after 41h of run, only 28000 records out of 28 million have been processed. My files contain between 28-40 million reads, so i need to somehow make it faster if this is possible. Any ideas or any things you might see in the code that make iot so slow?
def makeDictofFasta_withLgth(fastafile):
'''tested on Illumina IIx files after my cleaning routine, ie sequences used in velvet'''
mydict={}
info= SeqIO.index(fastafile, "fasta")
for rec in info.keys():
mydict[rec]=len(info[rec].seq)
print 'finished making dictionary of fasta records'
return mydict
def Addquals_inTrimmedFastA(fastq, newfastq, fasta):
outfastq=open(newfastq, "w")
fasta_dict=makeDictofFasta_withLgth(fasta)
fq_dict = SeqIO.index(fastq,"fastq-illumina")
for record in fasta_dict.keys():
if record in fq_dict.keys():
length= fasta_dict[record]
sub_rec = fq_dict[record][0:length]
outfastq.write(sub_rec.format('fastq-illumina'))
outfastq.close()
Thanks in advance,
Natassa
More information about the Biopython
mailing list