[Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif
bartek at rezolwenta.eu.org
Mon Sep 10 07:12:59 UTC 2012
I think it is an idea worth discussing a little bit more. Thanks for
bringing it up Michiel.
It captures at least some of the issues caused by the fact that
different motifs might be internally represented differently.
I'm not sure I'm all excited about having to deal with explicit extra
classes for PWMs and aligned instances, but maybe this is the price
for having a clear separation of where certain things are calculated.
The issue I think still needs discussion is where is the searching
done? If I want to search for instances, do I do it from the PWM
object?, This seems to be the natural idea, but then can we find a
nice interface for people who don't want to be bothered with too
I'll try to come up with a more thought through and longer response
later in the week...
On Sun, Sep 9, 2012 at 9:31 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Returning to a previous discussion...
>> ..., currently Bio.Motif._Motif.Motif objects also perform
>> functions that are more appropriate for a separate PWM
>> (position-weight matrix) class within Bio.Motif. It may be
>> a good idea to have a separate PWM class for this functionality.
>> I'm not sure. I think it is valuable to be able to load
>> instances from a file and then convert them to a PWM.
>> It could be done with separate classes,
>> but I'm not sure it would be easier then...
> I think there is one confusing issue here.
> The current .pwm() method of a Motif object doesn't calculate a position-weight matrix but only normalizes the counts matrix to create a probability matrix. To calculate a PWM, we would have to calculate the logarithm of these probabilities divided by the corresponding background probabilities (for which in Bio.Motif we are currently using the log_odds method).
> So I was mainly thinking of a PWM class to represent what is currently being returned by the log_odds method. This allows users to create a PWM from the log-odds scores directly instead of from an alignment (for example, if the PWM is available from some publication but not the actual alignments).
> Also this avoids some confusion with regard to which methods operate on which object. For example, currently we have motif.scanPWM and motif.score_hit that actually operate on the log-odds matrix,
> motif.anticonsensus, motif.consensus, motif[:] uses the probability matrix, and motif.max_score and motif.min_score use the log-odds matrix to evaluate the score of motif.consensus, motif.anticonsensus which were calculated using the probablity matrix (and therefore don't necessarily return the maximum and minimum score).
> So I would suggest to keep the various types of matrices explicit; something along these lines:
>>>> motif = Motif.read(...)
>>>> counts = motif.counts
> # .counts is a property of motif
> # counts is an instance of the Motif.FrequencyMatrix class
> # you can also make a FrequencyMatrix object directly from
> # the frequencies, as in
>>>> counts = Motif.FrequencyMatrix(my_frequency_matrix)
> array([1.0, 4.0, 3.0, 2.0])
> # indices refer explicitly to the counts matrix
>>>> my_consensus_sequence = counts.consensus
> # .consensus is a property of counts
>>>> my_anticonsensus_sequence = counts.anticonsensus
> # .anticonsensus is a property of counts
>>>> my_probability_matrix = counts.normalize()
> # this can be a numpy array, or a Motif.ProbabilityMatrix
> # class that inherits from a numpy array
> array([0.1, 0.4, 0.3, 0.2])
> # indices refer explicitly to the probability matrix
>>>> pwm = counts.make_pwm(...)
> # or pwm = motif.PositionWeightMatrix(my_matrix)
> array([ -2.3, 0.1, 1.2, 1.8])
> # indices explicitly refer to the pwm
>>>> scores = pwm.scan(sequence)
>>>> score = pwm.score(sequence)
> Does that sound reasonable? Any comments, suggestions?
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
More information about the Biopython-dev