[Biopython-dev] [biopython] added unstable Bam parser class (e6343eb)

Peter Cock p.j.a.cock at googlemail.com
Tue Apr 2 09:32:22 UTC 2013


>> On Tue, Apr 2, 2013 at 9:52 AM, Kevin Murray <k.d.murray.91 at gmail.com>
>> wrote:
>> > Hi All,
>> >
>> > Peter and I have
>> >
>> > discussed<https://github.com/kdmurray91/biopython/commit/e6343ebae50e4ff0633476a5761b47aa5ecacec4#commitcomment-2905033>including
>> > the SamBam parser he has worked on into the master branch. I've
>> > offered to help with test coverage/missing features/testing.
>> >
>> > The performance is very good; reading sequentially all reads from a 11mb
>> > (540k reads) Bam file took:
>> > CPython with Kevin's Pure-python parser: 0m17.531s real, 0m17.452s user
>> > CPython with Peter's Pure-python parser: 0m5.589s real, 0m5.560s user
>> > CPython with pysam: 4m29.240s real, 4m25.576s user
>> > Pypy1.9 with Kevin's Pure-python parser: 0m6.125s real, 0m6.056s user
>> > Pypy1.9 with Peter's Pure-python parser: 0m1.716s real, 0m1.624s user
>> >
>> > What are everyone's thoughts on including this into the master branch?
>> > (with a BiopythonExperimentalWarning)
>> >
>> > Regards,
>> > Kevin

> On 2 April 2013 19:59, Tiago Antão <tiagoantao at gmail.com> wrote:
>>
>> Regarding the performance comparison to pysam: wow! Fantastic!
>>

On Tue, Apr 2, 2013 at 10:09 AM, Kevin Murray <k.d.murray.91 at gmail.com> wrote:
> Hi Tiago,
> It is indeed impressive, which makes me suspect I've screwed something up in
> my benchmarks. I'll whack them up onto github for closer inspection sometime
> tomorrow (Aussie time).
>
> However, in general code:
>
> bam = BamParser("path")
> print next(bam)
> for mapping in bam:
>     pass
>
> Regards
> Kevin Murray

Those benchmark numbers are surprising - I suspect this is
not a fair comparison. The different parsers likely have very
different __str__ output for a BAM record (for mine this gives
a SAM format string, pysam does something close to SAM
but without the reference name).

Something like BAM to SAM and then SAM to BAM would be
better for profiling the basis parsing and writing performance.
After than random access, and maybe something where lazy
loading might have a chance to shine - perhaps counting the
number of reads mapped to the reverse strand (i.e. iterate
and look at the FLAG only).

Peter




More information about the Biopython-dev mailing list