[Biopython-dev] Bio.motifs.matrix.PositionSpecificScoringMatrix.calculate - scoring ambiguous sequences

Wed Jun 22 08:56:53 UTC 2016

Hi Sefa,

It looks like Michiel is extra busy at the moment, but in the absence
of his input, perhaps Bartek has some thoughts (he was the original
author of the motif module)?

I would suggest the new mode might need to be an option, e.g.
default to the current NaN results, but easy to pick your proposed
mean scoring. Might the minimum or maximum ever be useful?

If you (Sefa) want to go ahead and fork the repository and explore
this on a branch leading to a potential pull request that seems sensible.

Note we define the IUPAC ambiguity codes centrally in Python in
Bio/Data/IUPACData.py which you can import for the new Python
code. I can see arguments for and against having them hard coded
in the proposed new C code.

Thanks,

Peter

On Wed, Jun 22, 2016 at 2:37 AM, Sefa Kilic <sefa1 at umbc.edu> wrote:

> Any thoughts?
>
> On Mon, Jun 13, 2016 at 10:58 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>
>> What do you think Michiel?
>>
>> Also related, earlier today I filed this issue:
>> https://github.com/biopython/biopython/issues/851
>>
>> Peter
>>
>> On Mon, Jun 13, 2016 at 3:26 PM, Sefa Kilic <sefa1 at umbc.edu> wrote:
>> > Hello all,
>> >
>> > I have been using the Bio.motifs PSSM search for a long time.
>> Occasionally,
>> > I work with genome sequences containing ambiguous bases. Biopython
>> currently
>> > does not support scoring sequences with ambiguous bases and I would
>> like to
>> > propose a change to fix that.
>> >
>> > Currently, the "calculate" function in PositionSpecificScoringMatrix
>> class
>> > checks if alphabets of both motif and sequence are
>> > IUPAC.IUPACUnambiguousDNA. If they are not, a ValueError exception is
>> > raised.
>> >
>> > The code itself, however, tolerates ambiguous bases on the sequence as
>> NaN.
>> > That is, given a PSSM of length L, all L-mer subsequences of the given
>> > sequence are scored as NaN. I would like to extend it and do the scoring
>> > properly for ambiguous sequences. For instance, if the base is Y (C or
>> T),
>> > it should be scored as the average of scoring it as C and as T. If the
>> base
>> > is N, it should be scored as the average of all bases [S(A) + S(T) +
>> S(C) +
>> > S(G)] / 4.
>> >
>> > The change needs to be done on both Python and C (_pwm.c) sides. What
>> do you
>> > think? If you agree, I can implement it and send a pull request.
>> >
>> > Cheers,
>> >
>> > _______________________________________________
>> > Biopython-dev mailing list
>> > Biopython-dev at mailman.open-bio.org
>> > http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>>
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20160622/cc7a863a/attachment.html>