[emboss-dev] FASTQ parsing speed in EMBOSS

Tue Jul 28 08:05:47 UTC 2009

Peter wrote:
> Hi all,
> 
> I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for
> some of the FASTQ issues I've raised, and I decided to do a few
> simple benchmarks.
> 
> This is over 40 thousand reads per second, but I was still a
> little disappointed in the run time. Improvements in the FASTQ
> parsing/writing speed would help get EMBOSS used in
> sequencing centre pipelines. Once we have the EMBOSS
> FASTQ input/output working as intended, does trying to
> speed it up further seem worthwhile?

Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing 
the output takes about as long as reading the input. There may be ways to 
speed that up (output requires making an output sequence object which takes 
half the output time).

Building EMBOSS with --with-gccprofile and compiling with gcc creates a 
gprof profile. Very useful for catching bottlenecks.

Up to the advent of NGS data, large input/output runs have been limited to 
converting EMBL/GenBank into Fasta as a one-off every few months so looking 
into the efficiency of sequence reading/writing has been a low priority. 
Now it does assume much more importance.

> Another suggestion (although not demonstrated in the above
> benchmark) is for the Solexa FASTQ parsing (and output).
>>From looking at the code, you map the ASCII to a PHRED
> score for each letter of every read. This is a relatively
> expensive operation using powers and logs. I would try
> using a precomputed look up table (something I have just
> been working on for Biopython - this made a very big
> difference, especially when converting to/from Solexa
> scores to PHRED scores).

Yes, that was on my list of future changes. There wasn't time to fully 
implement and test before the release freeze.

regards,

Peter