[BioLib-dev] SAM/BAM use cases

Pjotr Prins pjotr.public14 at thebird.nl
Fri Jul 16 15:14:22 UTC 2010


Thanks for the use cases. Looks like there is some extra functionality
not defined elsewhere. My proposal is to start with the following:

(1) Add SAMTools to BioLib

(2) Provide a binding for the basic read/write methods
    (see http://samtools.sourceforge.net/samtools/sam/index.html?Functions/Functions.html#//apple_ref/c/func/samread)

(3) Bind the SAM/BAM data structures
    (see http://samtools.sourceforge.net/samtools/bam/index.html?DataTypes/DataTypes.html#//apple_ref/c/tdef/bam1_t)

(4) Provide mappings for Python, Ruby and Perl

(5) Write integration tests

(6) Generate documentation

This covers the basic low level SAMTools bindings. This ought to
allow you to access SAMTools methods and data structures from
Python/Ruby/Perl in a single environment.

Even if this is a duplication of some of the existing modules out
there, I would like to show this is possible as a cross Bio*
initiative.

Once it is in BioLib, I hope to show how easy it is to create new
(shared) functionality.

Pj.

On Fri, Jul 16, 2010 at 11:23:10AM +0100, Peter wrote:
> Regarding use cases for SAM/BAM, Jan Aerts wrote:
> 
> > An initial list:
> > * convert BAM<->SAM
> > * extract specific regions from BAM/SAM
> > * convert BAM/SAM -> FASTQ
> > * Extract singleton reads
> > * Duplicate removal
> > * Extract readpairs that are not mapped at the correct distance and/or
> > orientation
> 
> A few more use cases,
> 
> * Extract a single contig/reference and its mapped reads
>   (a special case of extract specific regions from BAM/SAM I guess)
> * Extract all the unmapped reads
> * Convert FASTQ -> unaligned SAM/BAM (useful for GATK)
> * Convert ACE (or other alignment formats) -> SAM/BAM
>   (I have a crude python script to do ACE to SAM)
> 
> Notice that some of these tasks are read-centric, and can probably be done
> by iterating over the reads one by one (no memory worries). Others are
> contig-centric, and this will require much more care.
> 
> > Further down the line it'd be nice to include functionality of GATK as well:
> > * Perform read clipping
> > * Local realignment in regions where there are a lot of SNPs (which might
> > actually indicate a single indel which would make those SNPs magically
> > disappear)
> > * base quality recalibration
> >
> > Just to get you started :-)
> >
> > jan.
> 
> With regards to converting (aligned) reads in SAM/BAM to FASTQ, there
> are issues to discuss. Firstly the original distinct read pair names are not
> (currently) held, although there was a suggestion on the samtools-devel
> list to hold this in the tags. All you get is the template name, and in my
> code I was appending /1 or /2 in the Illumina style for the forward and
> reverse reads. Secondly, to recover the original read orientation, any
> read mapped to the reverse strand in SAM/BAM must be reverse
> complemented (and its quality reversed). I mentioned these on the
> EMBOSS thread:
> http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html
> 
> Peter
> _______________________________________________
> BioLib-dev mailing list
> BioLib-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biolib-dev
> 



More information about the BioLib-dev mailing list