[emboss-dev] EMBOSS 6.3.0 released - SAM/BAM

Mon Aug 2 17:41:25 UTC 2010

On Thu, Jul 15, 2010 at 12:36 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>
>>> What do you do about naming for paired reads? I was appending
>>> /1 or /2 to match the Illumina convention. Doing nothing means
>>> the paired reads will have the same names.
>>
>> Not addressed yet - let's look into a common approach though.
>> We would also have to lok into what the '/' character does to EMBOSS's
>> handling of sequence names.
>
> My rational for appending the /1 and /2 is that in a typical workflow
> you might take Illumina paired end data as FASTQ and map it onto
> a genome with BWA giving SAM/BAM. You might then want to reverse
> this (e.g. if given a SAM/BAM file by a collaborator, and you want to
> try an alternative mapping tool or reference genome, first you must
> recover the raw reads again, e.g. as FASTQ files).

Just for the record, EMBOSS 6.3.1 does not append anything to the
read names, meaning paired end reads cannot be distinguished if
output as FASTA or FASTQ.

I'm not sure my idea of appending /1 or /2 for paired reads is the
best solution (especially since there are other naming schemes
out there like _f and _r as suffixes). Nevertheless, it seems like a
practical solution. Would including a slash character within a
sequence name cause problems in EMBOSS (a potential issue
you raised earlier)?

Also, and this may be a bug, on output as unaligned SAM (and I
assume also for unaligned BAM), the fact that a read is paired and
the information about if is it the first or second read is lost. The
FLAG is just set to 4, meaning unmapped. e.g.

seqret -sformat bam -osformat sam ex1.bam -filter

or:

seqret -sformat sam -osformat sam ex1.sam -filter

>>> What do you do about the strand issue? SAM/BAM stored reads
>>> which map onto the reverse strand in reverse complement. If
>>> you want to get back to the original orientation for output as
>>> FASTQ you must apply the reverse complement (plus reverse
>>> the quality scores too of course).
>>
>> So far we read as sequences. Reading as mapped reads (very large
>> alignments) is planned for the very near future so it can appear in the
>> next release.
>
> Given the use case of going from (aligned) SAM/BAM back to the
> original FASTQ, for a round trip you *must* undo the reverse
> complementation. This is important even for single reads, as quality
> scores tend to trail off in the (original) read direction so some algorithms
> may treat a reverse version of the read differently.

To clarify, EMBOSS 6.3.1 does not flip reads mapped to the reverse strand:
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html

Regards,

Peter C.