[Biojava-l] converting fastq format
Daniel Katzel
dkatzel at gmail.com
Thu Sep 17 23:45:52 UTC 2015
Sorry, that was a typo not using the SangerFastqReader in the original post
I made. I tried all the different readers just in case...
Using SangerReader still throws an exception if I use a non-sanger writer
FastqReader fastqReader = new SangerFastqReader();
FastqWriter fastqWriter = new IlluminaFastqWriter();
PrintStream out = ...
InputStream in = ...
fastqReader.stream(in,
new StreamListener() {
@Override
public void fastq(Fastq fastq) {
if (fastq.getSequence().length() > 20){
try {
fastqWriter.append(out, fastq);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
}
});
Still throws an exception when trying to write the first read.
If I change the FastqWriter to a SangerWriter so the reader and writer are
the same variant, it works as expected.
Stepping through the code, there is no code that actually performs any
conversion in the Writer implementations or their parent class
AbstractFastqWriter
It would be easy to add to the abstract writer, the code would be similar
to what I posted above to make a new encoded quality string using the
correct offset.
On Thu, Sep 17, 2015 at 5:50 AM, Peter Cock <p.j.a.cock at googlemail.com>
wrote:
> On Thu, Sep 17, 2015 at 3:26 AM, Daniel Katzel <dkatzel at gmail.com> wrote:
> >
> > The fastq file I was using is part of the 1000genomes phase 3 dataset
> > (very large gzipped files) with about 25 million records each. The reads
> > are short so it is probably old.
> >
> > Here's the file I used
> >
> >
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/SRR062634_1.filt.fastq.gz
> >
> > I made a histogram of the encoded quality values as ascii:
> >
> > 33 : 166838
> > 34 : 0
> > 35 : 100598505
> > 36 : 26817
> > 37 : 156873
> > 38 : 268700
> > 39 : 419677
> > 40 : 807326
> > 41 : 997720
> > 42 : 889665
> > 43 : 946268
> > 44 : 2372479
> > 45 : 4147316
> > 46 : 760108
> > 47 : 850433
> > 48 : 1433894
> > 49 : 1165379
> > 50 : 1769347
> > 51 : 2493316
> > 52 : 2966864
> > 53 : 12457233
> > 54 : 3172484
> > 55 : 3741809
> > 56 : 3722004
> > 57 : 4320581
> > 58 : 23804570
> > 59 : 6554713
> > 60 : 7207725
> > 61 : 33021639
> > 62 : 13106991
> > 63 : 60909837
> > 64 : 36753951
> > 65 : 70258165
> > 66 : 91889938
> > 67 : 102533947
> > 68 : 129093976
> > 69 : 368143099
> > 70 : 231023980
> > 71 : 1089945133
> >
> >
> > You can see the lowest value is 33 which means SANGER encoding.
> >
>
> Yes, this looks like the Sanger FASTQ encoding :)
>
> (Some data archives would convert from the legacy Solexa or Illumina
> 1.3+ quality encodings into the standard Sanger FASTQ encoding).
>
> Because this is the Sanger FASTQ encoding, you should be using the
> SangerFastqReader. Your original email was using the
> IlluminaFastqReader which should have complained that there were ASCI
> characters under 64 present. That is presumably what happened given
> the message:
>
>
> Caused by: java.io.IOException: sequence SRR062634.1
> HWI-EAS110_103327062:6:1:1092:8469/1 not fastq-illumina format, was
> fastq-sanger
> at
> org.biojava.nbio.sequencing.io.fastq.IlluminaFastqWriter.validate(IlluminaFastqWriter.java:43)
> at
> org.biojava.nbio.sequencing.io.fastq.AbstractFastqWriter.append(AbstractFastqWriter.java:62)
> at
> org.biojava.nbio.sequencing.io.fastq.AbstractFastqWriter.append(AbstractFastqWriter.java:46)
>
>
> Do you think this error message can be made clearer?
>
> We did come up with a whole set of functional tests including
> inter-conversion of the FASTQ encodings which are provided with the
> NAR paper as supplementary materials and used in the Bio* and EMBOSS
> test suites.
>
> http://dx.doi.org/10.1093/nar/gkp1137
>
> Peter
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-l/attachments/20150917/e48dbbad/attachment.html>
More information about the Biojava-l
mailing list