[Biojava-l] converting fastq format

Daniel Katzel dkatzel at gmail.com
Thu Sep 17 02:26:10 UTC 2015


The fastq file I was using is part of the 1000genomes phase 3 dataset (very
large gzipped files) with about 25 million records each.  The reads are
short so it is probably old.

Here's the file I used

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/SRR062634_1.filt.fastq.gz

I made a histogram of the encoded quality values as ascii:

  33 :          166838
  34 :               0
  35 :       100598505
  36 :           26817
  37 :          156873
  38 :          268700
  39 :          419677
  40 :          807326
  41 :          997720
  42 :          889665
  43 :          946268
  44 :         2372479
  45 :         4147316
  46 :          760108
  47 :          850433
  48 :         1433894
  49 :         1165379
  50 :         1769347
  51 :         2493316
  52 :         2966864
  53 :        12457233
  54 :         3172484
  55 :         3741809
  56 :         3722004
  57 :         4320581
  58 :        23804570
  59 :         6554713
  60 :         7207725
  61 :        33021639
  62 :        13106991
  63 :        60909837
  64 :        36753951
  65 :        70258165
  66 :        91889938
  67 :       102533947
  68 :       129093976
  69 :       368143099
  70 :       231023980
  71 :      1089945133


You can see the lowest value is 33 which means SANGER encoding.

I think the problem is the FastqWriter code only allows Fastq objects to be
written that have the same FastqVariant object. I also didn't see any unit
tests in biojava that tested converting the formats.  In fact there are
several tests that make sure the Fastq being written has the same
FastqVariant as the type of the writer.

For example
https://github.com/biojava/biojava/blob/master/biojava-sequencing/src/test/java/org/biojava/nbio/sequencing/io/fastq/IlluminaFastqWriterTest.java

has a test to make sure an IlluminaFastqWriter  only writes Fastq objects
that are FastqVariant.FASTQ_ILLUMINA

 public void testValidateNotIlluminaVariant()
{
    IlluminaFastqWriter writer = new IlluminaFastqWriter();
    Appendable appendable = new StringBuilder();
    Fastq invalid = new FastqBuilder()
         .withDescription("description")
         .withSequence("sequence")
         .withQuality("quality_")
         .withVariant(FastqVariant.FASTQ_SANGER)
         .build();
try
{
        writer.append(appendable, invalid);
        fail("validate not fastq-illumina variant expected IOException");
}
catch (IOException e)
{
// expected
}
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-l/attachments/20150916/54ac9481/attachment-0001.html>


More information about the Biojava-l mailing list