[BioLib-dev] [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM

Jan Aerts jan.aerts at gmail.com
Fri Jul 16 09:26:47 UTC 2010


An initial list:
* convert BAM<->SAM
* extract specific regions from BAM/SAM
* convert BAM/SAM -> FASTQ
* Extract singleton reads
* Duplicate removal
* Extract readpairs that are not mapped at the correct distance and/or
orientation

Further down the line it'd be nice to include functionality of GATK as well:
* Perform read clipping
* Local realignment in regions where there are a lot of SNPs (which might
actually indicate a single indel which would make those SNPs magically
disappear)
* base quality recalibration

Just to get you started :-)

jan.


On 16 July 2010 10:01, Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> EMBOSS has just recently added SAM/BAM support. I am looking at adding
> SAM/BAM support for the Bio* languages - BioRuby, BioPerl, BioPython
> and BioJava.
>
> There are three interesting implementations of BAM/SAM support. The
> Picard library (Java), Samtools API (C) and EMBOSS (C). A description
> of the Sequence Alignment/Map format (SAM) can be found
> [http://samtools.sourceforge.net/SAM1.pdf here]. SAM is a textual
> format, and BAM is the matching binary format. From the specification
> it is clear that BAM/SAM is a rather extensive format, for large
> files, and would certainly benefit from fast C parsing (over native
> Ruby/Perl/Python).
>
> To start it would be handy to have a few use cases. If I look at the
> Picard definition of a SAMRecord, see
>
>  http://picard.sourceforge.net/javadoc/net/sf/samtools/SAMRecord.html
>
> it just dumps the data into a structure. Like with the CIGARString field.
>
> That is very low level, but may be enough for BioLib. A SAM file
> is most likely one unit. If there are more we should read them in as
> an iterator - these can be large files and you don't want everything
> in memory all the time.
>
> So, a first use-case would be to read a BAM/SAM file into raw records,
> using an iterator. Right?
>
> Pj.
>
> On Thu, Jul 15, 2010 at 09:35:32AM -0500, Chris Fields wrote:
> > Peter brought this up at BOSC during my talk, re: our alignment
> > refactoring going on in relation to GSoC.  Currently, we have
> > Lincoln's SAMTools and UCSC-related perl distributions on CPAN for
> > our needs, but a common toolkit would be nice.
>
> > Also (somewhat related, and another leftover from BOSC), Jim Procter
> > raised the issue several time on how we intend on dealing with large
> > protein alignments or richly annotated alignment data, e.g.
> > Pfam/Rfam and stockholm (his interests were related to Jalview, but
> > it's a common concern).  We do this in-memory in BioPerl, in a
> > semi-hackish way, but a common SAMTools-like way would be nice.
> _______________________________________________
> BioLib-dev mailing list
> BioLib-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biolib-dev
>



More information about the BioLib-dev mailing list