[Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif

Michiel de Hoon mjldehoon at yahoo.com
Sun Sep 9 07:31:05 UTC 2012

Returning to a previous discussion...

> ..., currently Bio.Motif._Motif.Motif objects also perform
> functions that are more appropriate for a separate PWM
> (position-weight matrix) class within Bio.Motif. It may be
> a good idea to have a separate PWM class for this functionality.

> I'm not sure. I think it is valuable to be able to load
> instances from a file and then convert them to a PWM.
> It could be done with separate classes,
> but I'm not sure it would be easier then...

I think there is one confusing issue here.
The current .pwm() method of a Motif object doesn't calculate a position-weight matrix but only normalizes the counts matrix to create a probability matrix. To calculate a PWM, we would have to calculate the logarithm of these probabilities divided by the corresponding background probabilities (for which in Bio.Motif we are currently using the log_odds method).

So I was mainly thinking of a PWM class to represent what is currently being returned by the log_odds method. This allows users to create a PWM from the log-odds scores directly instead of from an alignment (for example, if the PWM is available from some publication but not the actual alignments).
Also this avoids some confusion with regard to which methods operate on which object. For example, currently we have motif.scanPWM and motif.score_hit that actually operate on the log-odds matrix, 
motif.anticonsensus, motif.consensus, motif[:] uses the probability matrix, and motif.max_score and motif.min_score use the log-odds matrix to evaluate the score of motif.consensus, motif.anticonsensus which were calculated using the probablity matrix (and therefore don't necessarily return the maximum and minimum score).

So I would suggest to keep the various types of matrices explicit; something along these lines:

>>> motif = Motif.read(...)
>>> counts = motif.counts
# .counts is a property of motif
# counts is an instance of the Motif.FrequencyMatrix class
# you can also make a FrequencyMatrix object directly from
# the frequencies, as in
>>> counts = Motif.FrequencyMatrix(my_frequency_matrix)
>>> counts[2,:]
array([1.0, 4.0, 3.0, 2.0])
# indices refer explicitly to the counts matrix
>>> counts[2,'G']

>>> my_consensus_sequence = counts.consensus
# .consensus is a property of counts
>>> my_anticonsensus_sequence = counts.anticonsensus
# .anticonsensus is a property of counts

>>> my_probability_matrix = counts.normalize()
# this can be a numpy array, or a Motif.ProbabilityMatrix
# class that inherits from a numpy array
>>> my_probability_matrix[2,:]
array([0.1, 0.4, 0.3, 0.2])
# indices refer explicitly to the probability matrix

>>> pwm = counts.make_pwm(...)
# or pwm = motif.PositionWeightMatrix(my_matrix)
>>> pwm[0,:]
array([ -2.3,  0.1,  1.2,  1.8])
>>> pwm[0,2]
>>> pwm[0,'C']
# indices explicitly refer to the pwm

>>> scores = pwm.scan(sequence)
>>> score = pwm.score(sequence)

Does that sound reasonable? Any comments, suggestions?


More information about the Biopython-dev mailing list