[Open-bio-l] More FASTQ examples for cross project testing

Tue Aug 25 11:24:27 UTC 2009

Hi all,

I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
off list about this plan. I'm going to co-ordinate putting together a
set of valid FASTQ files for shared testing (to supplement the
existing set of invalid FASTQ files already done and being used in
Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).

What I have in mind is:

XXX_original_YYY.fastq - sample input
XXX_as_sanger.fastq - reference output
XXX_as_solexa.fastq - reference output
XXX_as_illumina.fastq - reference output

where XXX is some name (e.g. wrapped1, wrapped2, shortreads,
longreads, sanger_full_range, solexa_full_range ...) and YYY is the
FASTQ variant (sanger, solexa or illumina) for the "input" file.

For example, we might have:

wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping,
perhaps repeating the title on the plus lines
wrapped1_as_sanger.fastq - The same data but using the consensus of no
line wrapping and omitting the repeated title on the plus lines.
wrapped1_as_solexa.fastq - As above, but converted in Solexa scores
(ASCII offset 64), with capping at Solexa 62 (ASCII 126).
wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII
offset 64, with capping at PHRED 62 (ASCII 126).

Here "wrapped1" would be a Sanger FASTQ file with some line wrapping
(e.g. at 60 characters). I will include "sanger_full_range" which
would cover all the valid PHRED scores from 0 to 93, and similarly for
Solexa and Illumina files - these are important for testing the score
conversions. I have some ideas for deliberately tricky (but valid)
files which should properly test any parser.

The point is we have "perhaps odd but valid" originals, plus the
"cleaned up" versions (using the same FASTQ variant), and "cleaned up"
versions in the other two FASTQ variants.

Ideally asking Biopython/BioPerl/EMBOSS to convert the
XXX_original_YYY.fastq files into any of the three FASTQ variants will
give exactly the same as the reference outputs.

If anyone has any comments or suggestions please speak up (e.g. my
suggested naming conventions).

Real life examples of FASTQ files anyone has had trouble parsing (even
with 3rd party tools) would be particularly useful - although we'd
probably want to cut down big example files in order to keep the
dataset to a reasonable size.

Thanks,

Peter