<div dir="ltr"><div><div>Hello Daniel,<br><br></div>I am sorry, and this is embarrassing, but I thought I remembered the writers supporting implicit conversion which as you point out is not the case.<br><br>The conversion needs to go to error probabilities and back because the quality score metrics are different, see the NAR paper linked below for details.<br><br></div><div>This pull request adds explicit conversion support and adds round trip functional tests based on the test data described in the paper<br><br><a href="https://github.com/biojava/biojava/pull/334">https://github.com/biojava/biojava/pull/334</a><br><br></div><div> michael<br><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Sep 17, 2015 at 6:45 PM, Daniel Katzel <span dir="ltr"><<a href="mailto:dkatzel@gmail.com" target="_blank">dkatzel@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>Sorry, that was a typo not using the SangerFastqReader in the original post I made. I tried all the different readers just in case...<br><br></div>Using SangerReader still throws an exception if I use a non-sanger writer<br><br> FastqReader fastqReader = new SangerFastqReader();<br> FastqWriter fastqWriter = new IlluminaFastqWriter(); <br> <br> <br> <br> PrintStream out = ...<br></div> InputStream in = ...<br><div> fastqReader.stream(in,<br> new StreamListener() {<br> <br> @Override<br> public void fastq(Fastq fastq) {<br> <br> if (fastq.getSequence().length() > 20){<br> <br> try {<br> fastqWriter.append(out, fastq);<span class=""><br> } catch (IOException e) {<br> throw new UncheckedIOException(e);<br> }<br> }<br> }<br></span> });<br><br></div><div>Still throws an exception when trying to write the first read.<br><br></div><div>If I change the FastqWriter to a SangerWriter so the reader and writer are the same variant, it works as expected.<br><br></div><div>Stepping through the code, there is no code that actually performs any conversion in the Writer implementations or their parent class AbstractFastqWriter<br><br></div><div>It would be easy to add to the abstract writer, the code would be similar to what I posted above to make a new encoded quality string using the correct offset.<br></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Sep 17, 2015 at 5:50 AM, Peter Cock <span dir="ltr"><<a href="mailto:p.j.a.cock@googlemail.com" target="_blank">p.j.a.cock@googlemail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div>On Thu, Sep 17, 2015 at 3:26 AM, Daniel Katzel <<a href="mailto:dkatzel@gmail.com" target="_blank">dkatzel@gmail.com</a>> wrote:<br>
><br>
> The fastq file I was using is part of the 1000genomes phase 3 dataset<br>
> (very large gzipped files) with about 25 million records each. The reads<br>
> are short so it is probably old.<br>
><br>
> Here's the file I used<br>
><br>
> <a href="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/SRR062634_1.filt.fastq.gz" rel="noreferrer" target="_blank">ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/SRR062634_1.filt.fastq.gz</a><br>
><br>
> I made a histogram of the encoded quality values as ascii:<br>
><br>
> 33 : 166838<br>
> 34 : 0<br>
> 35 : 100598505<br>
> 36 : 26817<br>
> 37 : 156873<br>
> 38 : 268700<br>
> 39 : 419677<br>
> 40 : 807326<br>
> 41 : 997720<br>
> 42 : 889665<br>
> 43 : 946268<br>
> 44 : 2372479<br>
> 45 : 4147316<br>
> 46 : 760108<br>
> 47 : 850433<br>
> 48 : 1433894<br>
> 49 : 1165379<br>
> 50 : 1769347<br>
> 51 : 2493316<br>
> 52 : 2966864<br>
> 53 : 12457233<br>
> 54 : 3172484<br>
> 55 : 3741809<br>
> 56 : 3722004<br>
> 57 : 4320581<br>
> 58 : 23804570<br>
> 59 : 6554713<br>
> 60 : 7207725<br>
> 61 : 33021639<br>
> 62 : 13106991<br>
> 63 : 60909837<br>
> 64 : 36753951<br>
> 65 : 70258165<br>
> 66 : 91889938<br>
> 67 : 102533947<br>
> 68 : 129093976<br>
> 69 : 368143099<br>
> 70 : 231023980<br>
> 71 : 1089945133<br>
><br>
><br>
> You can see the lowest value is 33 which means SANGER encoding.<br>
><br>
<br>
</div></div>Yes, this looks like the Sanger FASTQ encoding :)<br>
<br>
(Some data archives would convert from the legacy Solexa or Illumina<br>
1.3+ quality encodings into the standard Sanger FASTQ encoding).<br>
<br>
Because this is the Sanger FASTQ encoding, you should be using the<br>
SangerFastqReader. Your original email was using the<br>
IlluminaFastqReader which should have complained that there were ASCI<br>
characters under 64 present. That is presumably what happened given<br>
the message:<br>
<span><br>
<br>
Caused by: java.io.IOException: sequence SRR062634.1<br>
HWI-EAS110_103327062:6:1:1092:8469/1 not fastq-illumina format, was<br>
fastq-sanger<br>
at org.biojava.nbio.sequencing.io.fastq.IlluminaFastqWriter.validate(IlluminaFastqWriter.java:43)<br>
at org.biojava.nbio.sequencing.io.fastq.AbstractFastqWriter.append(AbstractFastqWriter.java:62)<br>
at org.biojava.nbio.sequencing.io.fastq.AbstractFastqWriter.append(AbstractFastqWriter.java:46)<br>
<br>
<br>
</span>Do you think this error message can be made clearer?<br>
<br>
We did come up with a whole set of functional tests including<br>
inter-conversion of the FASTQ encodings which are provided with the<br>
NAR paper as supplementary materials and used in the Bio* and EMBOSS<br>
test suites.<br>
<br>
<a href="http://dx.doi.org/10.1093/nar/gkp1137" rel="noreferrer" target="_blank">http://dx.doi.org/10.1093/nar/gkp1137</a><br>
<span><font color="#888888"><br>
Peter<br>
</font></span></blockquote></div><br></div>
</div></div></blockquote></div><br></div>