[emboss-dev] Bug report and patch - BAM quality score reading

Mon Aug 2 11:37:35 UTC 2010

Hi all,

Since I had several queries about what EMBOSS would do with SAM/BAM,
http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html
I decided to try it and see. I believe I have found a bug with reading quality
scores from BAM files in EMBOSS 6.3.1

$ seqret -version
EMBOSS:6.3.1

I have been using a small pair of SAM and BAM files, originally downloaded
as a SAM file with reference FASTA sequence from the pysam project, which
I converted to BAM using samtools as they specify in their readme file
http://code.google.com/p/pysam/source/browse/#hg/tests

e.g.

curl -O http://pysam.googlecode.com/hg/tests/ex1.fa
curl -O http://pysam.googlecode.com/hg/tests/ex1.sam.gz
gunzip ex1.sam.gz
samtools faidx ex1.fa
samtools import ex1.fa.fai ex1.sam ex1.bam

If we look at the first two reads in the SAM file, notice their quality strings:

$ head ex1.sam
EAS56_57:6:190:289:82	69	chr1	100	0	*	=	100	0	CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA	<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;	MF:i:192
EAS56_57:6:190:289:82	137	chr1	100	73	35M	=	100	0	AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC	<<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;	MF:i:64	Aq:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
...

Now let's ask EMBOSS seqret to convert from SAM to Sanger FASTQ,

$ seqret -sformat sam -osformat fastq-sanger ex1.sam -stdout -auto | head
@EAS56_57:6:190:289:82 chr1
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
+
<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;
@EAS56_57:6:190:289:82 chr1
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
+
<<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;
@EAS51_64:3:190:727:308 chr1
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG

The quality strings agree with the SAM file, good.

Now let's ask EMBOSS seqret to convert from BAM to Sanger FASTQ,

$ seqret -sformat bam -osformat fastq-sanger ex1.bam -stdout -auto | head
@EAS56_57:6:190:289:82
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
+
]]]X]]]\]]]]]]]]Y\\]X\U]\]\\\\\ZU]\
@EAS56_57:6:190:289:82
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
+
]]]]]]\]]]]]]]]]]\]]\]]]]\Y]W\Z\\S\
@EAS51_64:3:190:727:308
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG

The quality strings differ, this is bad.

In the SAM file these two reads have quality strings starting the "<",
ASCII 60 meaning PHRED 60-33 = 27.

In the funny BAM to Sanger FASTQ conversion, EMBOSS has used
"]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it
should be. I suspected that the EMBOSS code for reading BAM files
was wrongly applying a 33 offset to the quality scores. In BAM files
the scores are simply encoded directly as uint8_t without any offset.

Looking at the source code, file ajax/core/ajseqread.c we have:

        for(i=0; i < (ajuint) c->l_qseq; i++)
        {
            ajFmtPrintAppS(&qualstr, " %02x", 33+d[dpos]);
            thys->Accuracy[i] = (float) (33 + d[dpos++]);
        }

The creation of a quality string appears to be for debug only,
and here adding 33 to make it scores printable ASCII using
the Sanger FASTQ encoding makes sense. However, adding
the offset to the accuracy looks like an oversight. How about:

        for(i=0; i < (ajuint) c->l_qseq; i++)
        {
            ajFmtPrintAppS(&qualstr, " %02x", 33+d[dpos]);
            thys->Accuracy[i] = (float) d[dpos++];
        }

With this tiny change, I get the expected Sanger FASTQ output
from a BAM file using seqret.

Regards,

Peter C.