[Biopython-dev] Alignment object

Brad Chapman chapmanb at 50mail.com
Thu Mar 4 13:13:52 UTC 2010


Kevin and Peter;

> I was aware of pysam but am concerned about the dependencies:
> pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools
> itself - which may all be fine on Linux, but will likely be trouble for
> us on other platforms (especially Windows).

I believe you can remove the pyrex requirement by shipping the
generated C file with the distribution. Samtools itself may be an
issue; however, right now it is probably a practical need for dealing
with SAM/BAM since it implements a lot of BAM generation, sorting,
merging and indexing you need in workflows. Also, the C code is
included with the distribution so it is more a matter of getting it
compiled than introducing extra dependencies. The bioconductor work
appears to do the same thing.

> > I agree that we should work towards supporting SAM (and perhaps
> > also BAM) in Biopython, and other projects APIs can be very
> > useful for inspiration or guidance.

All of my work converts SAM directly into sorted and indexed BAM,
and then build from that. For me, direct SAM parsing wouldn't be as
useful as BAM.

> Honestly, the SAM/BAM format specification is pretty dodgy.  Thankfully
> between samtools and Picard source code, I've been able to work out most of
> the tricky bits.  I'm glad to know that the R folks are also working on
> this, since they're usually very good about generating clear documentation.

Agreed, but at least we are converging on something instead of
having to write a parser every time you use a new aligner. The
bioconductor SVN is here:

https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Rsamtools/
(user: readonly, pass: readonly)

I think the pysam API does a decent job for reading and exposing
this. The higher level things that would be nice to add are:

- Converting the CIGAR string into something more useful.
- Smartly dealing with the X? fields from various aligners. These
  often contain very useful information missing from the SAM
  specification. Where the data actually is will be aligner
  specific.
- More generally easing dealing with the optional fields.

> Parsing SAM is pretty simple and I can certainly help with gluing it into
> Biopython (with some help on the Biopython side, since I'm still a newb).
> I'm about half-way to having a BAM reader and writer for my own purposes.
>  I'm coding the time-critical parts in Cython with a fallback to pure
> Python, so it may not be ideal for use in Biopython.

Cool. Does the BAM reader require samtools C code or is it
independent of that?

Brad



More information about the Biopython-dev mailing list