[emboss-dev] Inconsistency in SAM vs BAM read description

Peter biopython at maubp.freeserve.co.uk
Mon Aug 2 16:42:07 UTC 2010


On Mon, Aug 2, 2010 at 5:26 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> After patching the following two issues,
> http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html
> http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000668.html
> there is a noticeable difference in the output from the SAM and BAM
> parsers in the description of the reads:
>
> $ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | head
>>EAS56_57:6:190:289:82
> CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>>EAS56_57:6:190:289:82
> AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>>EAS51_64:3:190:727:308
> GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>>EAS112_34:7:141:80:875
> AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>>EAS219_FC30151:3:40:1128:1940
> CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC
>
>
> $ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | head
>>EAS56_57:6:190:289:82 chr1
> CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>>EAS56_57:6:190:289:82 chr1
> AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>>EAS51_64:3:190:727:308 chr1
> GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>>EAS112_34:7:141:80:875 chr1
> AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>>EAS219_FC30151:3:40:1128:1940 chr1
> CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC
>
> As you can see from the above example (using files described in
> the linked threads), when parsing SAM files if the read is mapped
> then the reference sequence name is used as the description.
> This seems like a sensible and useful thing to do. However, when
> parsing BAM files this is not currently being done.
>
> Having the SAM and BAM parser produce identical results is
> very useful for testing purposes (e.g. running diff on their output
> as FASTQ format), so I would like the BAM parser to do the same.
>
> Looking at the source, function seqReadSam in ajax/core/ajseqread.c
> does this with the reference name string:
>
>    ajStrTokenNextParseNoskip(&handle,&token); /* RNAME */
>    ajDebug("RNAME '%S'\n", token);
>    if(ajStrGetLen(token))
>        seqAccSave(thys, token);
>

Just as a post script,

Having failed to enhance the BAM parser, for short term testing
I'm just commenting out the above two lines of the SAM parser.
With that trivial change, then the FASTA and FASTQ output
from both the SAM and BAM files agrees 100% (as you would
expect).

Peter C.




More information about the emboss-dev mailing list