[EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.

Thu Sep 17 10:06:16 UTC 2009

On Thu, Sep 17, 2009 at 10:52 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
>>> What we could do is provide a utility that reads in fastq-sanger format
>>> and checks whether the quality scores make most sense as Sanger,
>>> SOlexa or Ilumina.
>>
>> That could be useful - I guess you could scan all the reads building up
>> a histogram of the ASCII characters used. This could immediately
>> rule out some of the options, and then based on the distribution (if
>> you assume they are raw reads) you could make a good guess.
>
> The ACD file would be 'interesting' We could set the default format to be
> "fastq-sanger" and issue some warning if we find the user had tried to
> change it. That way the application would run with a filename as the input,
> though it will appear to interfaces to be able to read any sequence input.
>
> Are there rules we can use to decide on improbably qualities? Values below
> the Illumina and Solexa minima would seem a good guide, and perhaps
> values above the likely short read maximum score.
>
> Maybe some existing pipelines have solme cutoff values we could adopt?

Quite possibly. Telling apart raw Sanger reads and raw Solexa/Illumina
reads should be easy. However, unless there are some ASCII characters
in the range 59 to 63 (Solexa -5 to -1), there isn't going to be a safe way
to tell Solexa and Illumina 1.3+ apart. Of course, if they just have good
reads above Solexa/PHRED 10 (which would be ASCII 74), either way
it isn't going to make much difference. In any case, it will be heuristic,
and sometimes it will get it wrong (e.g. post processed Sanger FASTQ
files with high scores might look like raw reads in Solexa/Illumina
FASTQ).

>>> I consider reading as fastq-sanger by default to be rather dangerous.
>>
>> That is understandable. How about removing the current "fastq" output
>> then? That might prevent some of the confusion at the moment. I'm
>> struggling to see any purpose for the current "fastq" output - can you
>> give me any example use case? Right now it has to pick an arbitrary
>> quality symbol, and uses ASCI 34 (double quote) which means PHRED
>> 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or
>> Illumina 1.3+ FASTQ file.
>
> It is an alias for fastq-sanger which should be OK. I prefer to have an
> output format name for each input format name where it looks sensible,
> so if we read "fastq" as an input format it should do something on
> output. Unfortunately that means it has to write quality scores somehow.

I'm not convinced that the current "fastq" output (with the double quote
quality string) is entirely "sensible". But I'll drop this now - I've argued my
case, and will leave it at that. As long as the current behaviour is clear
in the documentation, it should be OK.

Regards,

Peter