[Biopython-dev] EMBOSS SAM/BAM parser and reverse strand reads

Peter biopython at maubp.freeserve.co.uk
Mon Aug 2 17:22:44 UTC 2010


Hi all,

One of my immediate questions on learning that EMBOSS 6.3.1 had
SAM/BAM support was how it handled reads mapped to the reverse
strand:

http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html
> What do you do about the strand issue? SAM/BAM stored reads
> which map onto the reverse strand in reverse complement. If
> you want to get back to the original orientation for output as
> FASTQ you must apply the reverse complement (plus reverse
> the quality scores too of course).

As I suspected, currently EMBOSS ignores this and gives the sequence
and quality string as it is stored in the SAM/BAM file.

Here are three consecutive entries from the example SAM file,
http://pysam.googlecode.com/hg/tests/ex1.sam.gz

...
EAS54_65:6:115:538:276	163	chr1	209	99	35M	=	360	186	TATTTGTAATGAAAACTATATTTATGCTATTCAGT	<<<<<<<<;<<<;;<<<;<:<<<:<<<<<<;;;7;	MF:i:18	Aq:i:75	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
EAS219_FC30151:7:51:1429:1043	83	chr1	209	99	35M	=	59	-185	TATTTGTAATGAAAACTATATTTATGCTATTCAGT	9<5<<<<<<<<<<<<<9<<<9<<<<<<<<<<<<<<	MF:i:18	Aq:i:68	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
EAS114_30:1:176:168:513	163	chr1	210	99	35M	=	410	235	ATTTGTAATGAAAACTATATTTATGCTATTCAGTT	<<<<;<<<<<<<<<<<<<<<<<<<:&<<<<:;0;;	MF:i:18	Aq:i:71	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
...

The middle read of this triple, EAS219_FC30151:7:51:1429:1043, maps
to chr1 on the reverse strand - we known this from the flag value 83.

Note 83 = 1 + 2 + 16 + 64, or in hex, 0x53 = 0x40 + 0x10 + 0x02 + 0x01.

Referring to the SAM/BAM specification,
0x01 - read is paired
0x02 - read is in a proper pair
0x10 - mapped to reverse strand
0x40 - first read in the pair

This is the FASTQ output via seqret from SAM or BAM using EMBOSS 6.3.1
with the previously discussed patches:

@EAS54_65:6:115:538:276
TATTTGTAATGAAAACTATATTTATGCTATTCAGT
+
<<<<<<<<;<<<;;<<<;<:<<<:<<<<<<;;;7;
@EAS219_FC30151:7:51:1429:1043
TATTTGTAATGAAAACTATATTTATGCTATTCAGT
+
9<5<<<<<<<<<<<<<9<<<9<<<<<<<<<<<<<<
@EAS114_30:1:176:168:513
ATTTGTAATGAAAACTATATTTATGCTATTCAGTT
+
<<<<;<<<<<<<<<<<<<<<<<<<:&<<<<:;0;;

Notice that all three read sequence and quality strings match the SAM file.

On the other hand, this is from my experimental branch for Biopython,
converting SAM/BAM to FASTQ:

...
@EAS54_65:6:115:538:276/2
TATTTGTAATGAAAACTATATTTATGCTATTCAGT
+
<<<<<<<<;<<<;;<<<;<:<<<:<<<<<<;;;7;
@EAS219_FC30151:7:51:1429:1043/1
ACTGAATAGCATAAATATAGTTTTCATTACAAATA
+
<<<<<<<<<<<<<<9<<<9<<<<<<<<<<<<<5<9
@EAS114_30:1:176:168:513/2
ATTTGTAATGAAAACTATATTTATGCTATTCAGTT
+
<<<<;<<<<<<<<<<<<<<<<<<<:&<<<<:;0;;
...

Ignore for the moment the fact that I'm adding /1 and /2 suffixes to the read
names for the first and second (forward and reverse) reads in a pair.

Notice that for the second read (which is mapped to the reverse strand)
I am deliberately returning the reverse complement of the sequence, with
the quality string reversed.

I'd like to propose that EMBOSS also invert the sequence for those reads
mapped to the reverse strand. This is essential for the use case of converting
SAM/BAM to get the *original* unmapped reads. This applies regardless of
the output format: FASTA, FASTQ, or unaligned SAM/BAM (since for now
EMBOSS does not output aligned SAM/BAM).

Given I think there are problems with the SAM/BAM parsing in EMBOSS
6.3.1 which will require a patch or point release anyway, I don't think we
need to worry about this change breaking backwards compatibility (as
long as this is done as part of the first bug fix update). However, this
isn't my decision of course ;)

To elaborate, the reason I am acutely aware of this issue is that it has
bitten me already. I had some (large) SAM/BAM files from a collaborator
for paired end transcriptome data mapped onto a draft genome. Due to
the file sizes we didn't want to transfer the original FASTQ files over
the internet as well. When I wanted to remap the reads to a different
reference, I instead extracted the reads from the SAM/BAM files. Initially
I converted from SAM to FASTQ using sed (and also in Python as a
check) without being aware of the reverse stand issue...

There could be some valid reasons the current EMBOSS behaviour is
useful - but right now I can't think of any. Any suggestions?

Regards,

Peter C.



More information about the Biopython-dev mailing list