[Biojava-l] converting fastq format

Thu Sep 17 09:50:51 UTC 2015

On Thu, Sep 17, 2015 at 3:26 AM, Daniel Katzel <dkatzel at gmail.com> wrote:
>
> The fastq file I was using is part of the 1000genomes phase 3 dataset
> (very large gzipped files) with about 25 million records each.  The reads
> are short so it is probably old.
>
> Here's the file I used
>
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/SRR062634_1.filt.fastq.gz
>
> I made a histogram of the encoded quality values as ascii:
>
>   33 :          166838
>   34 :               0
>   35 :       100598505
>   36 :           26817
>   37 :          156873
>   38 :          268700
>   39 :          419677
>   40 :          807326
>   41 :          997720
>   42 :          889665
>   43 :          946268
>   44 :         2372479
>   45 :         4147316
>   46 :          760108
>   47 :          850433
>   48 :         1433894
>   49 :         1165379
>   50 :         1769347
>   51 :         2493316
>   52 :         2966864
>   53 :        12457233
>   54 :         3172484
>   55 :         3741809
>   56 :         3722004
>   57 :         4320581
>   58 :        23804570
>   59 :         6554713
>   60 :         7207725
>   61 :        33021639
>   62 :        13106991
>   63 :        60909837
>   64 :        36753951
>   65 :        70258165
>   66 :        91889938
>   67 :       102533947
>   68 :       129093976
>   69 :       368143099
>   70 :       231023980
>   71 :      1089945133
>
>
> You can see the lowest value is 33 which means SANGER encoding.
>

Yes, this looks like the Sanger FASTQ encoding :)

(Some data archives would convert from the legacy Solexa or Illumina
1.3+ quality encodings into the standard Sanger FASTQ encoding).

Because this is the Sanger FASTQ encoding, you should be using the
SangerFastqReader. Your original email was using the
IlluminaFastqReader which should have complained that there were ASCI
characters under 64 present. That is presumably what happened given
the message:

Caused by: java.io.IOException: sequence SRR062634.1
HWI-EAS110_103327062:6:1:1092:8469/1 not fastq-illumina format, was
fastq-sanger
        at org.biojava.nbio.sequencing.io.fastq.IlluminaFastqWriter.validate(IlluminaFastqWriter.java:43)
        at org.biojava.nbio.sequencing.io.fastq.AbstractFastqWriter.append(AbstractFastqWriter.java:62)
        at org.biojava.nbio.sequencing.io.fastq.AbstractFastqWriter.append(AbstractFastqWriter.java:46)

Do you think this error message can be made clearer?

We did come up with a whole set of functional tests including
inter-conversion of the FASTQ encodings which are provided with the
NAR paper as supplementary materials and used in the Bio* and EMBOSS
test suites.

http://dx.doi.org/10.1093/nar/gkp1137

Peter