[Biopython-dev] Converting between PHRED and Solexa quality scores (and FASTQ files)

Tue Feb 24 23:11:14 UTC 2009

Peter;
This is a great summary. I think these things belong on the wiki on
the documentation page once the functionality is rolled into
Biopython; it's a shame to see useful documentation hidden on the dev
mailing list.

Agreed 100% with no auto conversion. Providing the functionality to
convert is plenty, and I think it would be more confusing to start
seeing one type of scores when you expected another. Also, given the
size of these data sets we want to be as lightweight as possible.

Brad

> Hopefully this information will be of general interest - I could have
> just stuck it on the end of Bug 2767 but thought it more suited to the
> mailing list (or even a blog post?).
> http://bugzilla.open-bio.org/show_bug.cgi?id=2767
> 
> Nice links on mapping between Solexa and PHRED scores,
> http://maq.sourceforge.net/qual.shtml
> http://maq.sourceforge.net/fastq.shtml (missing some brackets in the
> final formula at the time of writing, I've emailed them)
> 
> and:
> http://illumina.ucr.edu/ht/documentation/file-formats
> http://rcdev.umassmed.edu/pipeline/Alignment%20Scoring%20Guide%20and%20FAQ.html
> (note they are missing a minus sign in the definition of Q_solexa)
> 
> For good quality reads the two scores are almost equal - but they
> differ for poor quality reads (PHRED scores go to zero, but Solexa
> scores can be negative).
> 
> A standard FASTQ file (as used by Sanger) encodes the quality
> information using PHRED scores, while Solexa/Illumina decided to use
> their own schema in the FASTQ variant.
> 
> In a PHRED style FASTQ file, PHRED quality = ord(letter) - 33
> In a Solexa style FASTQ file, Solexa quality = ord(letter) - 64
> 
> >>> def phred_quality_from_fastq_letter(letter) :
> ...     return ord(letter) - 33
> ...
> >>> def solexa_quality_from_fastq_letter(letter) :
> ...     return ord(letter) - 64
> ...
> 
> Both these scores are defined in terms of the estimated probability of
> an error (between 0 for a good read and 1 for a bad read).  A
> probability of almost zero gives a high quality score, while a
> probability of almost one gives a very low quality score.
> 
> >>> def phred_quality_from_error(error) :
> ...     return -10*log(error,10)
> ...
> >>> def solexa_quality_from_error(error) :
> ...     return -10*log(error/(1-error),10)
> ...
> >>> solexa_quality_from_error(0.000000001)
> 89.999999995657035
> >>> solexa_quality_from_error(0.999999999)
> -90.000000118483911
> >>> phred_quality_from_error(0.000000001)
> 89.999999999999986
> >>> phred_quality_from_error(0.999999999)
> 4.3429446983771231e-09
> >>> phred_quality_from_error(1)
> -0.0
> 
> Using these relationships you can map between PHRED and Solexa quality
> scores, assuming their error estimation methods are equivalent,
> 
> >>> def solexa_quality_from_phred(phred_quality) :
> ...     return 10*log(10**(phred_quality/10.0) - 1, 10)
> ...
> >>> solexa_quality_from_phred(90)
> 89.999999995657035
> >>> solexa_quality_from_phred(50)
> 49.99995657033466
> >>> solexa_quality_from_phred(10)
> 9.5424250943932485
> >>> solexa_quality_from_phred(1)
> -5.8682532438011537
> >>> solexa_quality_from_phred(0.1)
> -16.32774717238372
> 
> Or, the other way round,
> 
> >>> def phred_quality_from_solexa(solexa_quality) :
> ...     return 10*log(10**(solexa_quality/10.0) + 1, 10)
> ...
> >>> phred_quality_from_solexa(90)
> 90.000000004342922
> >>> phred_quality_from_solexa(10)
> 10.41392685158225
> >>> phred_quality_from_solexa(0)
> 3.0102999566398116
> >>> phred_quality_from_solexa(-20)
> 0.043213737826425784
> 
> I think these python versions agree with the perl examples on
> http://maq.sourceforge.net/qual.shtml (doing a base ten logarithm
> seems much easier in python than in perl).
> 
> Combining this with the letter mapping using in the Solexa FASTQ
> files, ord(letter)-64, we have:
> 
> >>> def phred_quality_from_solexa_fastq_letter(letter) :
> ...     return 10*log(10**((ord(letter)-64)/10.0) + 1, 10)
> 
> This seems to agree with the perl example on
> http://maq.sourceforge.net/fastq.shtml (allowing for the missing
> brackets which I've emailed them about).
> 
> So, in conclusion:
> 
> >>> phred_quality_from_fastq_letter("!")
> 0
> >>> phred_quality_from_fastq_letter("{")
> 90
> >>> solexa_quality_from_fastq_letter("!")
> -31
> >>> solexa_quality_from_fastq_letter("{")
> 59
> >>> phred_quality_from_solexa_fastq_letter("!")
> 0.0034483543102526788
> >>> phred_quality_from_solexa_fastq_letter("{")
> 59.000005467440147
> 
> Its very tricky to guess which FASTQ variant you have from the data
> itself (but from the range of characters, some examples can only be
> Solexa style).
> 
> If we know we have a standard FASTQ file we can trivially get the
> PHRED scores.  If we have a Solexa encoded FASTQ file, we can
> trivially get the Solexa scores.  With this log mapping we *could*
> also do an implicit conversion of Solexa scores into PHRED scores, but
> due to floating point issues this is a little lossy.  I would say
> follow python conventions and go with making things explicit, and not
> do this automatically when parsing.  We could do this automatically if
> the user explicitly asks Bio.SeqIO to write out a "fastq-solexa"
> format file and their SeqRecords don't have Solexa qualities but do
> have PHRED qualities (or vice versa).
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev