[Biopython] fastq manipulations speed

Sun Mar 17 20:22:49 UTC 2013

Hi Natassa,

First, I wouldn't bother indexing.  This seems a one-and-done operation and
indexing is thus a waste of time.  Have the list of stuff you want to find
first, then iterate through the fasta file looking for what you want.  In
general though, what are you hoping to accomplish with the qualities?  That
would help immensely with any feedback and best practice suggestions.  Are
you just doing QC?  If so, fastQC might be a better option than rolling
your own solution.

One comment on the code that will speed it up:
don't use if record in fq_dict.keys().  That returns a list which is going
to have a lookup time proportional to the list size.  Do:
fq_keys = set(fq_dict.keys()) and then if record in fq_keys, this will be
O(1) lookup time.

Chris

On Sun, Mar 17, 2013 at 4:04 PM, natassa <natassa_g_2000 at yahoo.com> wrote:

> Hi biopython list,
>
> I have a few fasta files that come from processing fastq illumina reads
> for quality, polyAs adaptors etc, but i need to get their associated
> qualities back. I wrote a simple script that calls the following 2
> funbctions, which I think are the fastest way to deal with fastq-fasta
> files in Biopython, but the script is awfully slow:
> For example, for one of my files, after 41h of run, only 28000 records out
> of 28 million have been processed. My files contain between 28-40 million
> reads, so i need to somehow make it faster if this is possible. Any ideas
> or any things you might see in the code that make iot so slow?
>
> def makeDictofFasta_withLgth(fastafile):
>     '''tested on Illumina IIx files after my cleaning routine, ie
> sequences used in velvet'''
>     mydict={}
>     info= SeqIO.index(fastafile, "fasta")
>     for rec in info.keys():
>         mydict[rec]=len(info[rec].seq)
>     print 'finished making dictionary of fasta records'
>     return mydict
>
>
> def Addquals_inTrimmedFastA(fastq, newfastq, fasta):
>     outfastq=open(newfastq, "w")
>     fasta_dict=makeDictofFasta_withLgth(fasta)
>     fq_dict = SeqIO.index(fastq,"fastq-illumina")
>
>     for record in fasta_dict.keys():
>         if record in fq_dict.keys():
>             length= fasta_dict[record]
>             sub_rec = fq_dict[record][0:length]
>             outfastq.write(sub_rec.format('fastq-illumina'))
>     outfastq.close()
>
> Thanks in advance,
>
> Natassa
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>