[Open-bio-l] More FASTQ examples for cross project testing

Michael Heuer heuermh at acm.org
Wed Aug 26 02:56:20 UTC 2009


Peter wrote:

> Hi all,
>
> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
> off list about this plan. I'm going to co-ordinate putting together a
> set of valid FASTQ files for shared testing (to supplement the
> existing set of invalid FASTQ files already done and being used in
> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).
>
> What I have in mind is:
>
> XXX_original_YYY.fastq - sample input
> XXX_as_sanger.fastq - reference output
> XXX_as_solexa.fastq - reference output
> XXX_as_illumina.fastq - reference output
>
> where XXX is some name (e.g. wrapped1, wrapped2, shortreads,
> longreads, sanger_full_range, solexa_full_range ...) and YYY is the
> FASTQ variant (sanger, solexa or illumina) for the "input" file.
>
> For example, we might have:
>
> wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping,
> perhaps repeating the title on the plus lines
> wrapped1_as_sanger.fastq - The same data but using the consensus of no
> line wrapping and omitting the repeated title on the plus lines.
> wrapped1_as_solexa.fastq - As above, but converted in Solexa scores
> (ASCII offset 64), with capping at Solexa 62 (ASCII 126).
> wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII
> offset 64, with capping at PHRED 62 (ASCII 126).
>
> Here "wrapped1" would be a Sanger FASTQ file with some line wrapping
> (e.g. at 60 characters). I will include "sanger_full_range" which
> would cover all the valid PHRED scores from 0 to 93, and similarly for
> Solexa and Illumina files - these are important for testing the score
> conversions. I have some ideas for deliberately tricky (but valid)
> files which should properly test any parser.
>
> The point is we have "perhaps odd but valid" originals, plus the
> "cleaned up" versions (using the same FASTQ variant), and "cleaned up"
> versions in the other two FASTQ variants.
>
> Ideally asking Biopython/BioPerl/EMBOSS to convert the
> XXX_original_YYY.fastq files into any of the three FASTQ variants will
> give exactly the same as the reference outputs.
>
> If anyone has any comments or suggestions please speak up (e.g. my
> suggested naming conventions).

Very cool idea, Peter, and Peter, and Chris.  I don't believe anyone from
biojava has spoken up on this thread yet, so I thought I should add that
we are working towards a compatible implementation as well.

   michael




More information about the Open-Bio-l mailing list