[EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.

Peter biopython at maubp.freeserve.co.uk
Thu Sep 17 09:32:21 UTC 2009


On Thu, Sep 17, 2009 at 10:18 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
>> So again, could you reconsider making "fastq" act like "fastq-sanger"?
>> The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores,
>> a superset of the Solexa/Illumina FASTQ varaints - so even if you don't
>> know which kind of FASTQ file you have, and you don't care about the
>> qualities, parsing it as a Sanger FASTQ file will work.
>
> Yes, but it is dangerous if they could really be Solexa qualities.

Indeed, or an Illumina 1.3+ encoded FASTQ file.

So if the EMBOSS tools are used to read a FASTQ file without specifying
the FASTQ variant, do the currently detect it is FASTQ and default to the
"fastq" setting and ignore the quality information?

> What we could do is provide a utility that reads in fastq-sanger format and
> checks whether the quality scores make most sense as Sanger, SOlexa or
> Ilumina.

That could be useful - I guess you could scan all the reads building up
a histogram of the ASCII characters used. This could immediately
rule out some of the options, and then based on the distribution (if
you assume they are raw reads) you could make a good guess.

> I consider reading as fastq-sanger by default to be rather dangerous.

That is understandable. How about removing the current "fastq" output
then? That might prevent some of the confusion at the moment. I'm
struggling to see any purpose for the current "fastq" output - can you
give me any example use case? Right now it has to pick an arbitrary
quality symbol, and uses ASCI 34 (double quote) which means PHRED
1 (random) for a Sanger FASTQ file but is invalid as a Solexa or
Illumina 1.3+ FASTQ file.

Regards,

Peter



More information about the EMBOSS mailing list