[emboss-dev] Inconsistency in SAM vs BAM read description
biopython at maubp.freeserve.co.uk
Mon Aug 2 16:26:07 UTC 2010
After patching the following two issues,
there is a noticeable difference in the output from the SAM and BAM
parsers in the description of the reads:
$ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | head
$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | head
As you can see from the above example (using files described in
the linked threads), when parsing SAM files if the read is mapped
then the reference sequence name is used as the description.
This seems like a sensible and useful thing to do. However, when
parsing BAM files this is not currently being done.
Having the SAM and BAM parser produce identical results is
very useful for testing purposes (e.g. running diff on their output
as FASTQ format), so I would like the BAM parser to do the same.
Looking at the source, function seqReadSam in ajax/core/ajseqread.c
does this with the reference name string:
ajStrTokenNextParseNoskip(&handle,&token); /* RNAME */
ajDebug("RNAME '%S'\n", token);
Therefore the BAM parser needs to do something similar, first
mapping the integer rID (reference sequence ID) to the array of
reference names from the BAM header.
I got as far as a partial solution but it only worked on the first read.
The problem is that although header variable ntargets is stored as
bamdata->Nref it does not appear that the array of strings
targetname is kept (likewise the array of integers targetlen but we
don't care about that here).
More information about the emboss-dev