[Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS

Sat Jul 25 21:12:26 UTC 2009

On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields<cjfields at illinois.edu> wrote:
>
>> Now, here comes the problem. I believe FASTQ files directly
>> from an Illumina 1.3+ pipeline will have PHRED scores in the
>> range 0 to 40 (as in this example). However, much higher
>> PHRED scores are possible during assembly / contig'ing
>> and read mapping. For example, the tool MAQ will output
>> Sanger style FASTQ files with PHRED scores in the range
>> 0 to 93 inclusive.
>
> Is this behavior documented anywhere, specifically by Illumina (that values
> can exceed 40)?  If Illumina 1.3 is specified as being PHRED 0-40, and
> another (non-Illumina) software package pushes that limit above the
> specified range of Illumina values, I would consider that unfortunately yet
> another variant.
>
> We can support it as Illumina 1.3, but my point is this may getting into a
> grey area and may be something that Illumina doesn't/wouldn't support.
>  Reminds me a little of the multiple GFF2 variations (one of the main
> reasons for a GFF3).

I agree this is an grey area (high scores in Solexa/Illumina
FASTQ files).

>> Now, in the Sanger FASTQ format, PHRED scores of 0 to
>> 93 map onto ASCII values of 33 to 126 (! to ~). There is a
>> reason for stopping at 126, since ASCII 127 is "delete".
>>
>> However, in the Illumina 1.3+ FASTQ format, PHRED
>> scores of 0 to 93 would map to ASCII values of 64 to
>> 157, which includes a lot of non printing characters.
>> Working with such files at the command line or in an
>> editor is a big problem. Clearly, Illumina never intended
>> to include such high scores in their FASTQ files!
>
> Exactly.
>
>> Nevertheless, it is possible to write a FASTQ format
>> following the Illumina 1.3+ encoding with these values.
>> Biopython and EMBOSS attempt to do this - although I
>> would regard throwing an error as equally acceptable.
>>
>> So, here is another hand constructed example of a
>> Sanger style FASTQ file using the full quality range:
>>
>> ...
>>
>> Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree
>> on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to
>> 64 in decimal, which after subtracting the Illumina offset of 64, gives
>> PHRED scores of 93 to 0 as desired.
>>
>> Now to BioPerl,
>>
>> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq
>> ...
>>
>> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v
>> ...
>>
>> BioPerl has output an invalid FASTQ file - it seems to omit the
>> quality scores for the top scoring nucleotides at the start. The
>> BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40
>> (in hex), giving 104 to 64 in decimal, giving PHRED values of
>> 40 to 0. I think BioPerl should either throw an error, or output
>> the non printing characters as done by Biopython and EMBOSS.
>
> If this is accepted as common practice between BioPython and EMBOSS
> we will follow similarly.  I do think it's worth at least a warning for the
> reasons outlined above (e.g. it likely isn't Illumina's intent to support qual
> values outside the specified range).  Might be worth checking into.

True. I think what EMBOSS and Biopython are doing is reasonable
(although a warning in this situation makes sense). Equally, an
error is a valid option. However, one question is when would you
issue the warning/error? For a PHRED score above 40? (Assuming
we have a definative reference for Illumina using just 0 to 40).
How about if a problem character would result? Since ASCII
64+63=127, the first problem character would be for PHRED
score 63.

i.e. An Illumina FASTQ format file can hold PHRED scores in the
range 0 to 62 without using problem characters. And likewise
for a Solexa FASTQ file (Solexa scores up to 62).

> From this it could be summarized that converting to sanger format is least
> problematic, as possible issues may be encountered when converting to the
> other variants.

Yes. The Sanger FASTQ format will hold PHRED scores from 0 to 93
while using nice ASCII characters - this means it is suitable for both
raw reads and processed data from assemblies or read mappings.

In my personal experience, Solexa/Illumina FASTQ files tend to get
converted into the Sanger FASTQ format for downstream analysis
(e.g. the MAQ tool, or the NCBI short read archive).

i.e. Writing high quality reads (i.e. above PHRED 40) to Solexa or
Illumina FASTQ files is unlikely.

> We'll need to fix the solexa quality calculations in the BioPerl
> parser as noted in your previous post; I'll work on that.

Great.

Peter