[BioLib-dev] R: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM

Pjotr Prins pjotr.public14 at thebird.nl
Fri Jul 16 14:36:40 UTC 2010


I am glad there is such a positive interest in BAM/SAM support for
the Bio* languages.

On Fri, Jul 16, 2010 at 12:08:11PM +0100, Peter wrote:
> I haven't looked at the EMBOSS code, but I assume it is currently
> geared towards looking at the reads as individual sequences - much
> like FASTQ and most other file formats supported by EMBOSS (and much
> like my own experiments for Biopython).  This makes perfect sense
> for extending the existing EMBOSS tools to work with SAM/BAM. Is
> this a fair description Peter R?

I would also like to know what EMBOSS has to offer over SAMtools -
with regard to the API. I'll read the code soon, but sometimes that
doesn't explain all ;)

> On the other hand, the samtools C API is very rich and allows a lot
> of alignment based operations (e.g. access to reads based on mapped
> position to a reference).  Isn't the samtools C API a broader more
> useful code base to wrap in the Bio* projects? It will also be kept
> up to date with the expected file format changes.
> 
> Perhaps Pjotr could clarify what he has in mind for BioLib?

What Biolib offers is the *plumbing* for SWIG bindings to the
different languages. This is perhaps less easy than you may think,
especially when you want to support many platforms. As it stands, we
have a tested environment which makes it relatively easy to provide
bindings.  EMBOSS, for one is, is partially mapped. Do it for one
language, you have it for all.

Just an example, a single mapping of Staden IO Lib is defined here:

  http://github.com/pjotrp/biolib/blob/master/src/mappings/swig/staden_io_lib.i

this generates the bindings for Python:

  http://github.com/pjotrp/biolib/blob/master/src/mappings/swig/python/test/test_staden_io_lib.py

Perl:

  http://github.com/pjotrp/biolib/blob/master/src/mappings/swig/perl/test/test_staden_io_lib.pl

Ruby:

  http://github.com/pjotrp/biolib/blob/master/src/mappings/swig/ruby/test/test_staden_io_lib.rb

Above three are in effect integration tests for BioLib.

In addition BioLib has generated API documentation, from SWIG output.
It works, but I still need to make it attractive. The output can be
HTML, or a native format like POD, rdoc, Javadoc. 

So what we can do with Biolib is create an API binding, write test
cases and generate documentation. The next step is to find a way to
easily deploy with existing Bio* libraries. That is where I need help
from the different Bio* projects. For BioRuby we are creating a
plugin system where you install BioRuby itself, followed by a command
line utility that allows you to download and hook-in plugins.
Including BioLib support.

My take is that it is less work for a Bio* project to support Biolib -
which only needs to be done once - than to write bindings for all
different libraries on your own. Let alone, duplicate code and testing
for reading/writing BAM/SAM.

We share the effort of supporting C bindings between the different
Bio* projects.

BioLib is not meant to add functionality to upstream libraries -
unless the upstream doesn't want to support some extension. For
example, for Affyio in BioLib I had to write a sensible interface. The
original version had a specialized R interface. The author had no
interest in supporting the extra interface. So that comes with Biolib.

With SAMTools the API looks decent, so we probably have to add
little in terms of functionality. EMBOSS would require a bit more,
unless everyone is happy with EMBOSS naming conventions (I think they
are less than perfect).

A use case list - as Jan has started - is needed, as it allows me to
start on something that is useful to someone now. A typical library
mapping can be done fairly quickly. Unless the library leans heavily
on some persistent state and/or using C pointers for returning values
through the API call. SWIG can be kinda tricky there.

For me, I think I would agree with Jan to try SAMTools first. Though
there is no reason not to support both libraries.

To recap the advantage of Biolib: easy deployment, shared
implementation Python, Perl, Ruby and others. Generated documentation
and a shared testing environment. Map once, run anywhere.

We will probably hit a few snags in this exercise. That is
unavoidable. But I hope the real gain will be to everyone here.

Sorry for the lengthy E-mail - it is not like me.

Pj.



More information about the BioLib-dev mailing list