[Biopython-dev] Converting between PHRED and Solexa quality scores (and FASTQ files)

Tue Feb 24 17:22:01 UTC 2009

Hopefully this information will be of general interest - I could have
just stuck it on the end of Bug 2767 but thought it more suited to the
mailing list (or even a blog post?).
http://bugzilla.open-bio.org/show_bug.cgi?id=2767

Nice links on mapping between Solexa and PHRED scores,
http://maq.sourceforge.net/qual.shtml
http://maq.sourceforge.net/fastq.shtml (missing some brackets in the
final formula at the time of writing, I've emailed them)

and:
http://illumina.ucr.edu/ht/documentation/file-formats
http://rcdev.umassmed.edu/pipeline/Alignment%20Scoring%20Guide%20and%20FAQ.html
(note they are missing a minus sign in the definition of Q_solexa)

For good quality reads the two scores are almost equal - but they
differ for poor quality reads (PHRED scores go to zero, but Solexa
scores can be negative).

A standard FASTQ file (as used by Sanger) encodes the quality
information using PHRED scores, while Solexa/Illumina decided to use
their own schema in the FASTQ variant.

In a PHRED style FASTQ file, PHRED quality = ord(letter) - 33
In a Solexa style FASTQ file, Solexa quality = ord(letter) - 64

>>> def phred_quality_from_fastq_letter(letter) :
...     return ord(letter) - 33
...
>>> def solexa_quality_from_fastq_letter(letter) :
...     return ord(letter) - 64
...

Both these scores are defined in terms of the estimated probability of
an error (between 0 for a good read and 1 for a bad read).  A
probability of almost zero gives a high quality score, while a
probability of almost one gives a very low quality score.

>>> def phred_quality_from_error(error) :
...     return -10*log(error,10)
...
>>> def solexa_quality_from_error(error) :
...     return -10*log(error/(1-error),10)
...
>>> solexa_quality_from_error(0.000000001)
89.999999995657035
>>> solexa_quality_from_error(0.999999999)
-90.000000118483911
>>> phred_quality_from_error(0.000000001)
89.999999999999986
>>> phred_quality_from_error(0.999999999)
4.3429446983771231e-09
>>> phred_quality_from_error(1)
-0.0

Using these relationships you can map between PHRED and Solexa quality
scores, assuming their error estimation methods are equivalent,

>>> def solexa_quality_from_phred(phred_quality) :
...     return 10*log(10**(phred_quality/10.0) - 1, 10)
...
>>> solexa_quality_from_phred(90)
89.999999995657035
>>> solexa_quality_from_phred(50)
49.99995657033466
>>> solexa_quality_from_phred(10)
9.5424250943932485
>>> solexa_quality_from_phred(1)
-5.8682532438011537
>>> solexa_quality_from_phred(0.1)
-16.32774717238372

Or, the other way round,

>>> def phred_quality_from_solexa(solexa_quality) :
...     return 10*log(10**(solexa_quality/10.0) + 1, 10)
...
>>> phred_quality_from_solexa(90)
90.000000004342922
>>> phred_quality_from_solexa(10)
10.41392685158225
>>> phred_quality_from_solexa(0)
3.0102999566398116
>>> phred_quality_from_solexa(-20)
0.043213737826425784

I think these python versions agree with the perl examples on
http://maq.sourceforge.net/qual.shtml (doing a base ten logarithm
seems much easier in python than in perl).

Combining this with the letter mapping using in the Solexa FASTQ
files, ord(letter)-64, we have:

>>> def phred_quality_from_solexa_fastq_letter(letter) :
...     return 10*log(10**((ord(letter)-64)/10.0) + 1, 10)

This seems to agree with the perl example on
http://maq.sourceforge.net/fastq.shtml (allowing for the missing
brackets which I've emailed them about).

So, in conclusion:

>>> phred_quality_from_fastq_letter("!")
0
>>> phred_quality_from_fastq_letter("{")
90
>>> solexa_quality_from_fastq_letter("!")
-31
>>> solexa_quality_from_fastq_letter("{")
59
>>> phred_quality_from_solexa_fastq_letter("!")
0.0034483543102526788
>>> phred_quality_from_solexa_fastq_letter("{")
59.000005467440147

Its very tricky to guess which FASTQ variant you have from the data
itself (but from the range of characters, some examples can only be
Solexa style).

If we know we have a standard FASTQ file we can trivially get the
PHRED scores.  If we have a Solexa encoded FASTQ file, we can
trivially get the Solexa scores.  With this log mapping we *could*
also do an implicit conversion of Solexa scores into PHRED scores, but
due to floating point issues this is a little lossy.  I would say
follow python conventions and go with making things explicit, and not
do this automatically when parsing.  We could do this automatically if
the user explicitly asks Bio.SeqIO to write out a "fastq-solexa"
format file and their SeqRecords don't have Solexa qualities but do
have PHRED qualities (or vice versa).

Peter