[BioLib-dev] [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM

Fri Jul 16 09:01:26 UTC 2010

EMBOSS has just recently added SAM/BAM support. I am looking at adding
SAM/BAM support for the Bio* languages - BioRuby, BioPerl, BioPython
and BioJava.

There are three interesting implementations of BAM/SAM support. The
Picard library (Java), Samtools API (C) and EMBOSS (C). A description
of the Sequence Alignment/Map format (SAM) can be found
[http://samtools.sourceforge.net/SAM1.pdf here]. SAM is a textual
format, and BAM is the matching binary format. From the specification
it is clear that BAM/SAM is a rather extensive format, for large
files, and would certainly benefit from fast C parsing (over native
Ruby/Perl/Python).

To start it would be handy to have a few use cases. If I look at the
Picard definition of a SAMRecord, see

  http://picard.sourceforge.net/javadoc/net/sf/samtools/SAMRecord.html

it just dumps the data into a structure. Like with the CIGARString field. 

That is very low level, but may be enough for BioLib. A SAM file
is most likely one unit. If there are more we should read them in as
an iterator - these can be large files and you don't want everything
in memory all the time.

So, a first use-case would be to read a BAM/SAM file into raw records,
using an iterator. Right?

Pj.

On Thu, Jul 15, 2010 at 09:35:32AM -0500, Chris Fields wrote:
> Peter brought this up at BOSC during my talk, re: our alignment
> refactoring going on in relation to GSoC.  Currently, we have
> Lincoln's SAMTools and UCSC-related perl distributions on CPAN for
> our needs, but a common toolkit would be nice.

> Also (somewhat related, and another leftover from BOSC), Jim Procter
> raised the issue several time on how we intend on dealing with large
> protein alignments or richly annotated alignment data, e.g.
> Pfam/Rfam and stockholm (his interests were related to Jalview, but
> it's a common concern).  We do this in-memory in BioPerl, in a
> semi-hackish way, but a common SAMTools-like way would be nice.