[Biopython] MOODS: fast search for position weight matrix matches in DNA sequences.

Peter biopython at maubp.freeserve.co.uk
Thu Sep 24 12:09:16 UTC 2009


On Thu, Sep 24, 2009 at 12:46 PM, Bartek Wilczynski
<bartek at rezolwenta.eu.org> wrote:
> On Thu, Sep 24, 2009 at 11:59 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Hi all,
>>
>> I'm forwarding an interesting post from Dave to the BioPerl mailing list, which
>> should also be of interest here...
>>
>
> Hi all,
>
> I've seen this paper. It is directly related to the Bio.Motif code.
> They did a pretty good job of implementing an extremely efficient tool
> for finding motif instances in DNA sequences. it's c++ and it beats my
> pure python, brute-force code with both hands down... Of course this
> come at a price of only being applicable to DNA (only unambiguous
> alphabet etc.). Since they did the comparison, we have actually
> incorporated the _pwm.c module written by Michiel, which is also much
> faster and can be used for finding motifs in DNA.

I hadn't looked at the table until you pointed this out. I think they have
been negligent by not including the version numbers of the different
packages tested (and this is a general point, not just about Biopython).

> I have compared their performance with our code on a single Drosophila
> chromosme (20Mb) the results are similarly devastating to my old code:
> their code takes ~1.1 sec (advanced look-ahead algorithms in C++)
> while mine (pure python) takes 350 secs. The code contributed recently
> by Michiel (simple algorithm, but in C) takes 2.3secs to finish.

Our C code looks pretty good then :)

> since they provide python interface (there is nothing biopython
> related, despite their abstract), I was even thinking about
> incorporating their code into Biopython, but it's GPL, Instead, I can
> make the function using Michiel's code aware of the MOODS package:
> i.e. use it if it is installed.

I'm not sure about that from an architectural point of view, especially
if the two algorithms give different results or take different parameters.

> If we want to put it into the news, It would be worth mentioning that
> (thanks to Michiel) we have made quite some progress on that front.

Good idea - why don't you check in an extra paragraph to the NEWS
file section for Biopython 1.51 (or was it 1.52?). We can also update
the news post too. In fact, if you wanted to you could write up a whole
blog post to put up on our news server with timing etc.

> As a side note, I feel a little bit guilty of making biopython look
> slow compared to other tools. In the paper, they show a comparison
> between different tools (MOODS, bioperl, biopython) in terms of speed,
> which shows biopython as by far the slowest. This is just because I
> was not writing this  code with speed in mind (I work on short
> regulatory sequences...). Nonetheless, it can make an impression that
> biopython is slow in general, which is not true. I will try to extend
> Michiel's code to accept different alphabets and then maybe phase out
> the slow code of mine.

Extending the C code to cover more cases sounds like a good idea.
However, I would keep the pure python fallback for situations like
Jython where C extensions are not available.

Peter




More information about the Biopython mailing list