[emboss-dev] FASTQ parsing speed in EMBOSS

Peter biopython at maubp.freeserve.co.uk
Tue Jul 28 09:21:33 UTC 2009


On Tue, Jul 28, 2009 at 9:05 AM, Peter Rice<pmr at ebi.ac.uk> wrote:
>
> Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing the
> output takes about as long as reading the input. There may be ways to speed
> that up (output requires making an output sequence object which takes half
> the output time).
>
> Building EMBOSS with --with-gccprofile and compiling with gcc creates a
> gprof profile. Very useful for catching bottlenecks.

Nice tip.

> Up to the advent of NGS data, large input/output runs have been limited to
> converting EMBL/GenBank into Fasta as a one-off every few months so looking
> into the efficiency of sequence reading/writing has been a low priority. Now
> it does assume much more importance.

Exactly :)

>> Another suggestion (although not demonstrated in the above
>> benchmark) is for the Solexa FASTQ parsing (and output).
>> From looking at the code, you map the ASCII to a PHRED
>> score for each letter of every read. This is a relatively
>> expensive operation using powers and logs. I would try
>> using a precomputed look up table (something I have just
>> been working on for Biopython - this made a very big
>> difference, especially when converting to/from Solexa
>> scores to PHRED scores).
>
> Yes, that was on my list of future changes. There wasn't time to fully
> implement and test before the release freeze.

That makes sense - and it is a pretty obvious thing to try, so
I would have been surprised if you hadn't come up with the
same idea.

Peter



More information about the emboss-dev mailing list