[Biopython-dev] Calculating motif scores

Bartek Wilczynski bartek at rezolwenta.eu.org
Tue Jul 21 14:56:33 UTC 2009


Hi,

sorry for the delayed response. Busy time...

On Fri, Jul 17, 2009 at 4:25 AM, Michiel de Hoon<mjldehoon at yahoo.com> wrote:
>
> It doesn't have to be so short. I've been running these calculations for whole mammalian chromosomes. For the human chromosome 1, this would take
> 247249719 * 4 bytes = 943 MB to store the scores in a Numerical Python array. This can still be comfortably handled by today's computers.

Well, I'm not sure if this is an expected behavior for typical uses
for a single function call to allocate that much memory. Especially
that most people would be interested in the "hits" which exceed some
significance threshold.

Nonetheless, there will  be cases where the user is interested in all
scores for a sequence, even the negative ones. Then it is definitely
better to provide him with an array rather than a generator.
>
> I'll upload a C version to CVS so you guys can have a look and try it out.
>
I took a brief look. It seems fine to me. I haven't done any testing yet though.

I'll try to integrate it into a method of Bio.Motif. What do you think
about: Motif.scanPWM(self, sequence) ?

> How would you feel about having a separate PWM class in Bio.Motif? Some of the stuff currently in the class Motif is actually more > about the PWM by itself; it may make sense to separate that out.

Hmm, I think that your question connects directly to a bigger design
question  which has popped up earlier in the discussion on Bio.Motif
suggestions:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005811.html

I'm not sure myself whether I like to have  different classes for
different motif types: consensus, alignment, regexp, pwm and hmm. I
understand though, that this makes things simpler for people who only
use one of those types so that don't have to deal with the
complications of a motif possibly coming from different sources and
behaving (slightly) differently.

I still think that it's useful to have a Motif class that can be used
in a similar way for different kinds of motifs. As for the PWM being a
separate class and used by the motif: I don't know. I'm using
Bio.substmat.FreqTable for implementing frequency table, so I
understand that the new PWM class would be basically a "smarter"
FreqTable. I'm not sure whether it solves any problems...

cheers
  Bartek



More information about the Biopython-dev mailing list