[emboss-dev] EMBOSS 6.3.0 released - SAM/BAM

Thu Jul 15 11:36:11 UTC 2010

On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
>> What do you do about naming for paired reads? I was appending
>> /1 or /2 to match the Illumina convention. Doing nothing means
>> the paired reads will have the same names.
>
> Not addressed yet - let's look into a common approach though.
> We would also have to lok into what the '/' character does to EMBOSS's
> handling of sequence names.

My rational for appending the /1 and /2 is that in a typical workflow
you might take Illumina paired end data as FASTQ and map it onto
a genome with BWA giving SAM/BAM. You might then want to reverse
this (e.g. if given a SAM/BAM file by a collaborator, and you want to
try an alternative mapping tool or reference geneome, first you must
recover the raw reads again, e.g. as FASTQ files).

>> What do you do about the strand issue? SAM/BAM stored reads
>> which map onto the reverse strand in reverse complement. If
>> you want to get back to the original orientation for output as
>> FASTQ you must apply the reverse complement (plus reverse
>> the quality scores too of course).
>
> So far we read as sequences. Reading as mapped reads (very large
> alignments) is planned for the very near future so it can appear in the
> next release.

Given the use case of going from (aligned) SAM/BAM back to the
original FASTQ, for a round trip you *must* undo the reverse
complementation. This is important even for single reads, as quality
scores tend to trail off in the (original) read direction so some algorithms
may treat a reverse version of the read differently.

>> Do you support writing SAM/BAM files? If so, would this be
>> for aligned reads or unaligned reads only?
>
> Yes we do write them - so far unaligned but we will add aligned reads
> when we can treat that as an input type.

I was thinking about this for my experimental SAM/BAM support
for Biopython - doing unaligned output only is much more straight
forward for a stream based writer (no seeks) as you don't have to
worry about header information like reference sequences. Although
not as useful as writing aligned SAM/BAM, some people are already
using unaligned SAM/BAM for storing read data - e.g. GATK
http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

>> Assuming you do write BAM files, do you support the recent
>> convention to use a single BGZF block [for the header], and
>> that where possible reads should not span a BGZF block boundary?
>
> We looked at samtools 1.7 to get things working. We still need to look at
> issues such as using the index for access to remote BAM files, and various
> flavours of blocks. I was not aware of the single block version. Again, we
> should compare notes.

Is here fine or on a more cross project list? I think BioPerl are sticking
to wrapping samtools rather than experimenting with a reimplementation.

I corrected above in line - samtools now uses a single BGZF block
for the BAM *header*. This was done to make rewriting the header
easier (you don't need to decompress and re-compress any reads
which happened to be with the header in the first block).

>> (I'm assuming some of the EMBOSS team must be on the
>> samtools-devel mailing list which is where most SAM/BAM
>> format discussion seems to take place)
>
> Actually no, but I will join it ASAP and catch up.

Excellent - there have been some interesting discussions about
BAM v2 (e.g. moving the header block, handling indels better)
and the possibility of using HDF5 underneath rather than the
in house gzip variant.

Peter C.