[emboss-dev] FASTQ support in Biopython, BioPerl, and EMBOSS

Peter biopython at maubp.freeserve.co.uk
Fri Jul 24 13:53:40 UTC 2009


On Fri, Jul 24, 2009 at 2:32 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>>
>> Don't be surprised if there are still bugs lurking about, just let me
>> know and I'll fix 'em.
>
> I've got a bug report coming up in a second email, but the basics work :)

I think I have found a bug in BioPerl's conversion from fastq-solexa
to fastq-sanger concerning lower quality scores.

Here is an artificial Solexa file using the Solexa scores from 40 down
to -5 (which I believe to be the full range expected from an instrument).

$ more solexa_faked.fastq
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+slxa_0001_1_0001_01
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;

A Solexa quality of 40 maps to ASCII 40+64 = 104, "h"
A Solexa quality of -5 maps to ASCII -5+64 = 59, ";"
You should find this example has Solexa scores 40, 39, .., -4, -5.
This file is in the Biopython repository under biopython/Tests/Quality

Here is the conversion using MAQ (with the chomp fix from Tim Yu
to remove an extra "!" character, see the maq-help mailing list for
10 July 2009):

http://sourceforge.net/mailarchive/forum.php?thread_name=320fb6e00906170708lb2ce4f7qbc5dfa43543189a2%40mail.gmail.com&forum_name=maq-help

$ perl fq_all2std.pl sol2std < solexa_faked.fastq
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+
IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##""

Here is the Biopython conversion, which is identical:

$ python biopython_solexa2sanger.py < solexa_faked.fastq @slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+
IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##""

EMBOSS 6.1.0 has a rounding issue with negative Solexa
scores, and the last six qualities are up by one - Peter Rice
is aware of this, and has a fix:
http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000596.html

$ seqret -filter -sformat fastq-solexa -osformat fastq-sanger <
solexa_faked.fastq
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+slxa_0001_1_0001_01
IHGFEDCBA@?>=<;:9876543210/.-,+*)(''&%%$$##"""

Now we come to BioPerl,

$ perl bioperl_solexa2sanger.pl < solexa_faked.fastq
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+slxa_0001_1_0001_01
IHGFEDCBA@?>=<;:9876543210/.-,+++*)(''&&&&%%%%

You look fine for the higher qualities, but there is something
really wrong for the lower scores (not just the negative ones).

I'll leave you to double check the details, but here are the
Sanger PHRED qualities decoded into integers (using Biopython
to convert from "fastq-sanger" to "qual" output):

$ perl bioperl_solexa2sanger.pl < solexa_faked.fastq | python
biopython_sanger2qual.py
>slxa_0001_1_0001_01
40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
20 19 18 17 16 15 14 13 12 11 10 10 10 9 8 7 6 6 5 5 5 5 4
4 4 4

$ perl fq_all2std.pl sol2std < solexa_faked.fastq | python
biopython_sanger2qual.py
>slxa_0001_1_0001_01
40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
1 1

Peter C.

P.S. This is the BioPerl script I am using here:

$ more bioperl_solexa2sanger.pl
use Bio::SeqIO;
my $in  = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-solexa');
my $out = Bio::SeqIO->new(-format => 'fastq-sanger');
while (my $seq = $in->next_seq) { $out->write_seq($seq) };



More information about the emboss-dev mailing list