[BioLib-dev] SAM/BAM use cases

Peter biopython at maubp.freeserve.co.uk
Fri Jul 16 10:23:10 UTC 2010


Regarding use cases for SAM/BAM, Jan Aerts wrote:

> An initial list:
> * convert BAM<->SAM
> * extract specific regions from BAM/SAM
> * convert BAM/SAM -> FASTQ
> * Extract singleton reads
> * Duplicate removal
> * Extract readpairs that are not mapped at the correct distance and/or
> orientation

A few more use cases,

* Extract a single contig/reference and its mapped reads
  (a special case of extract specific regions from BAM/SAM I guess)
* Extract all the unmapped reads
* Convert FASTQ -> unaligned SAM/BAM (useful for GATK)
* Convert ACE (or other alignment formats) -> SAM/BAM
  (I have a crude python script to do ACE to SAM)

Notice that some of these tasks are read-centric, and can probably be done
by iterating over the reads one by one (no memory worries). Others are
contig-centric, and this will require much more care.

> Further down the line it'd be nice to include functionality of GATK as well:
> * Perform read clipping
> * Local realignment in regions where there are a lot of SNPs (which might
> actually indicate a single indel which would make those SNPs magically
> disappear)
> * base quality recalibration
>
> Just to get you started :-)
>
> jan.

With regards to converting (aligned) reads in SAM/BAM to FASTQ, there
are issues to discuss. Firstly the original distinct read pair names are not
(currently) held, although there was a suggestion on the samtools-devel
list to hold this in the tags. All you get is the template name, and in my
code I was appending /1 or /2 in the Illumina style for the forward and
reverse reads. Secondly, to recover the original read orientation, any
read mapped to the reverse strand in SAM/BAM must be reverse
complemented (and its quality reversed). I mentioned these on the
EMBOSS thread:
http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html

Peter



More information about the BioLib-dev mailing list