[emboss-dev] EMBOSS 6.3.0 released - SAM/BAM

Tue Aug 3 08:12:27 UTC 2010

On Tue, Aug 3, 2010 at 8:27 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>
>> Just for the record, EMBOSS 6.3.1 does not append anything to the
>> read names, meaning paired end reads cannot be distinguished if
>> output as FASTA or FASTQ.
>>
>> I'm not sure my idea of appending /1 or /2 for paired reads is the
>> best solution (especially since there are other naming schemes
>> out there like _f and _r as suffixes). Nevertheless, it seems like a
>> practical solution. Would including a slash character within a
>> sequence name cause problems in EMBOSS (a potential issue
>> you raised earlier)?
>
> The /1 and /2 would cause horrible problems. The sequence names are
> used to generate default output file names so a '/' would have to be
> removed or converted, most likely to _1 and _2

Oh :(

I thought they might cause confusion with slashes in filenames, but
yes, they can't be used in filenames can they.

> _f or _r as a suffix is much better ... but should we always assume these
> meanings? Should we add a command-line switch for paired read data?

My understanding is there are multiple different naming conventions,
so whatever we/you do it won't please everyone. What would help here
is if the original read name were to be recorded in the SAM/BAM tags,
as I think was suggested last month or so on the samtools-devel mailing
list. However, that would come with a filesize penalty, and won't help
with old files.

> Should we only do something for fastq, sam and bam (or other NGS
> formats?)

And FASTA too, not all assemblers use quality scores. Also QUAL
files if EMBOSS were to support them.

> It is a mystery to me how paired reads came to have the same name.
> When we first used them at EMBL for the Human HPRT locus we made
> sure to add an "r" suffix to the reverse reads.... but then, as we used
> the GCG assembly system, we were forced to have a unique name :-)

With Solexa/Illumina data, pairs got the same name bar a suffix.
Other sequencing centers also have followed this pattern, for
example Sanger sequencing with suffices of .f and .r for example.
I guess in order to clearly group paired reads, and save a little space,
for SAM/BAM they opted to store a single name and use the FLAG field
to hold if it is the forward or reverse read. Note that with stobed reads
and the like coming "soon", rather than just two reads in a pair, there
could be many child reads for a single fragment. Even with classic
Sanger sequencing of a PCR product you might end up with multiple
reads (e.g. two forward reads, one reverse) and if and how to handle
this via an extension to SAM/BAM was also raised.

Some pipelines may even use the same name for a forward/reverse
pair, or ignore the names. Velvet for example just takes its paired
data as interleaved files (forward then reverse reads one after the
other).

>> Also, and this may be a bug, on output as unaligned SAM (and I
>> assume also for unaligned BAM), the fact that a read is paired and
>> the information about if is it the first or second read is lost. The
>> FLAG is just set to 4, meaning unmapped. e.g.
>>
>> seqret -sformat bam -osformat sam ex1.bam -filter
>
> Hmmm ... this kind of thing is specific to SAM-BAM conversions, as other
> formats will lose it unless we find some way to preserve the detail.
>
> We will take a look at what we can keep between these formats (we do
> make similar efforts between EMBL and GenBank formats)

I think it would be useful to track the three bits for paired, read one, and
read two. From memory, all the other bits of the FLAG are only applicable
to mapped reads. Of course, this overlaps with the naming issue above.

>>> Given the use case of going from (aligned) SAM/BAM back to the
>>> original FASTQ, for a round trip you *must* undo the reverse
>>> complementation. This is important even for single reads, as quality
>>> scores tend to trail off in the (original) read direction so some
>>> algorithms may treat a reverse version of the read differently.
>
> We will look into that one too.
>

Thanks.

> Many thanks for the suggestions
>

No problem.

Peter C.