[Biopython-dev] Bio.motifs.matrix.PositionSpecificScoringMatrix.calculate - scoring ambiguous sequences

Mon Jun 13 14:26:13 UTC 2016

Hello all,

I have been using the Bio.motifs PSSM search for a long time. Occasionally,
I work with genome sequences containing ambiguous bases. Biopython
currently does not support scoring sequences with ambiguous bases and I
would like to propose a change to fix that.

Currently, the "calculate" function in PositionSpecificScoringMatrix class
checks if alphabets of both motif and sequence are
IUPAC.IUPACUnambiguousDNA. If they are not, a ValueError exception is
raised.

The code itself, however, tolerates ambiguous bases on the sequence as NaN.
That is, given a PSSM of length L, all L-mer subsequences of the given
sequence are scored as NaN. I would like to extend it and do the scoring
properly for ambiguous sequences. For instance, if the base is Y (C or T),
it should be scored as the average of scoring it as C and as T. If the base
is N, it should be scored as the average of all bases [S(A) + S(T) + S(C) +
S(G)] / 4.

The change needs to be done on both Python and C (_pwm.c) sides. What do
you think? If you agree, I can implement it and send a pull request.

Cheers,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20160613/6057a460/attachment.html>