[Biopython-dev] Bio.motifs.matrix.PositionSpecificScoringMatrix.calculate - scoring ambiguous sequences

Peter Cock p.j.a.cock at googlemail.com
Wed Jun 22 09:17:03 UTC 2016


Thanks Bartek - very useful comments!

I was also worried about changing the default - making this into
a separate method would be even more explicit than the optional
argument I was suggesting.

(Apologies, I had a typo in Michiel's email address.)

Peter

On Wed, Jun 22, 2016 at 10:08 AM, Bartek Wilczynski <barwil at gmail.com>
wrote:

> Dear Peter and Sefa,
>
> I was involved in Bio.Motif, but not so much in Bio.motifs anymore. Since
> you ask me personally, and I think this is an important issue, I can
> certainly give you comments, and you can make decisions accordingly.
>
> This is a big change in  the sense, that it can potentially:
> - give surprising results to some people (the semantic you suggest is by
> no means standard, and the usual interpretation of this score is log-odds,
> so taking arithmetic mean of log odds is questionable, however arguably
> natural to some users)
> - allow for some errors (using DNA motif on protein sequence, etc) that
> are currently prevented by strict checking
> - slow things down when people are actually searching for unambiguous DNA
> motifs
>
> If you think this functionality is important (I can see many places where
> it would come in handy), I'd suggest considering writing a new function.
> like "calculate_ambiguous" or something similar, that would be slower and
> have a certain semantic (potentially, user selected - I could see people
> interested in min and max semantic in addition to average proposed by Sefa).
>
> That's pretty much my 2 cents
> B
>
> On Wed, Jun 22, 2016 at 10:56 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>
>> Hi Sefa,
>>
>> It looks like Michiel is extra busy at the moment, but in the absence
>> of his input, perhaps Bartek has some thoughts (he was the original
>> author of the motif module)?
>>
>> I would suggest the new mode might need to be an option, e.g.
>> default to the current NaN results, but easy to pick your proposed
>> mean scoring. Might the minimum or maximum ever be useful?
>>
>> If you (Sefa) want to go ahead and fork the repository and explore
>> this on a branch leading to a potential pull request that seems sensible.
>>
>> Note we define the IUPAC ambiguity codes centrally in Python in
>> Bio/Data/IUPACData.py which you can import for the new Python
>> code. I can see arguments for and against having them hard coded
>> in the proposed new C code.
>>
>> Thanks,
>>
>> Peter
>>
>> On Wed, Jun 22, 2016 at 2:37 AM, Sefa Kilic <sefa1 at umbc.edu> wrote:
>>
>>> Any thoughts?
>>>
>>> On Mon, Jun 13, 2016 at 10:58 AM, Peter Cock <p.j.a.cock at googlemail.com>
>>> wrote:
>>>
>>>> What do you think Michiel?
>>>>
>>>> Also related, earlier today I filed this issue:
>>>> https://github.com/biopython/biopython/issues/851
>>>>
>>>> Peter
>>>>
>>>> On Mon, Jun 13, 2016 at 3:26 PM, Sefa Kilic <sefa1 at umbc.edu> wrote:
>>>> > Hello all,
>>>> >
>>>> > I have been using the Bio.motifs PSSM search for a long time.
>>>> Occasionally,
>>>> > I work with genome sequences containing ambiguous bases. Biopython
>>>> currently
>>>> > does not support scoring sequences with ambiguous bases and I would
>>>> like to
>>>> > propose a change to fix that.
>>>> >
>>>> > Currently, the "calculate" function in PositionSpecificScoringMatrix
>>>> class
>>>> > checks if alphabets of both motif and sequence are
>>>> > IUPAC.IUPACUnambiguousDNA. If they are not, a ValueError exception is
>>>> > raised.
>>>> >
>>>> > The code itself, however, tolerates ambiguous bases on the sequence
>>>> as NaN.
>>>> > That is, given a PSSM of length L, all L-mer subsequences of the given
>>>> > sequence are scored as NaN. I would like to extend it and do the
>>>> scoring
>>>> > properly for ambiguous sequences. For instance, if the base is Y (C
>>>> or T),
>>>> > it should be scored as the average of scoring it as C and as T. If
>>>> the base
>>>> > is N, it should be scored as the average of all bases [S(A) + S(T) +
>>>> S(C) +
>>>> > S(G)] / 4.
>>>> >
>>>> > The change needs to be done on both Python and C (_pwm.c) sides. What
>>>> do you
>>>> > think? If you agree, I can implement it and send a pull request.
>>>> >
>>>> > Cheers,
>>>> >
>>>> > _______________________________________________
>>>> > Biopython-dev mailing list
>>>> > Biopython-dev at mailman.open-bio.org
>>>> > http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>>>>
>>>
>>>
>>> _______________________________________________
>>> Biopython-dev mailing list
>>> Biopython-dev at mailman.open-bio.org
>>> http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>>>
>>
>>
>
>
> --
> Bartek Wilczynski
> ==================
> Institute of Informatics
> University of Warsaw
> http://www.mimuw.edu.pl/~bartek
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20160622/d799fd59/attachment.html>


More information about the Biopython-dev mailing list