[Biopython-dev] Bio.motifs.matrix.PositionSpecificScoringMatrix.calculate - scoring ambiguous sequences

Sefa Kilic sefa1 at umbc.edu
Wed Jun 22 20:35:27 UTC 2016


Thank you, Peter and Bartek, for the comments.

I agree that modifying existing calculate function might be confusing for
users and it can slow things down. I will go ahead and create a separate
function and you can merge it if you think it would be useful for others.

Cheers,

On Wed, Jun 22, 2016 at 5:17 AM, Peter Cock <p.j.a.cock at googlemail.com>
wrote:

> Thanks Bartek - very useful comments!
>
> I was also worried about changing the default - making this into
> a separate method would be even more explicit than the optional
> argument I was suggesting.
>
> (Apologies, I had a typo in Michiel's email address.)
>
> Peter
>
>
> On Wed, Jun 22, 2016 at 10:08 AM, Bartek Wilczynski <barwil at gmail.com>
> wrote:
>
>> Dear Peter and Sefa,
>>
>> I was involved in Bio.Motif, but not so much in Bio.motifs anymore. Since
>> you ask me personally, and I think this is an important issue, I can
>> certainly give you comments, and you can make decisions accordingly.
>>
>> This is a big change in  the sense, that it can potentially:
>> - give surprising results to some people (the semantic you suggest is by
>> no means standard, and the usual interpretation of this score is log-odds,
>> so taking arithmetic mean of log odds is questionable, however arguably
>> natural to some users)
>> - allow for some errors (using DNA motif on protein sequence, etc) that
>> are currently prevented by strict checking
>> - slow things down when people are actually searching for unambiguous DNA
>> motifs
>>
>> If you think this functionality is important (I can see many places where
>> it would come in handy), I'd suggest considering writing a new function.
>> like "calculate_ambiguous" or something similar, that would be slower and
>> have a certain semantic (potentially, user selected - I could see people
>> interested in min and max semantic in addition to average proposed by Sefa).
>>
>> That's pretty much my 2 cents
>> B
>>
>> On Wed, Jun 22, 2016 at 10:56 AM, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>>
>>> Hi Sefa,
>>>
>>> It looks like Michiel is extra busy at the moment, but in the absence
>>> of his input, perhaps Bartek has some thoughts (he was the original
>>> author of the motif module)?
>>>
>>> I would suggest the new mode might need to be an option, e.g.
>>> default to the current NaN results, but easy to pick your proposed
>>> mean scoring. Might the minimum or maximum ever be useful?
>>>
>>> If you (Sefa) want to go ahead and fork the repository and explore
>>> this on a branch leading to a potential pull request that seems sensible.
>>>
>>> Note we define the IUPAC ambiguity codes centrally in Python in
>>> Bio/Data/IUPACData.py which you can import for the new Python
>>> code. I can see arguments for and against having them hard coded
>>> in the proposed new C code.
>>>
>>> Thanks,
>>>
>>> Peter
>>>
>>> On Wed, Jun 22, 2016 at 2:37 AM, Sefa Kilic <sefa1 at umbc.edu> wrote:
>>>
>>>> Any thoughts?
>>>>
>>>> On Mon, Jun 13, 2016 at 10:58 AM, Peter Cock <p.j.a.cock at googlemail.com
>>>> > wrote:
>>>>
>>>>> What do you think Michiel?
>>>>>
>>>>> Also related, earlier today I filed this issue:
>>>>> https://github.com/biopython/biopython/issues/851
>>>>>
>>>>> Peter
>>>>>
>>>>> On Mon, Jun 13, 2016 at 3:26 PM, Sefa Kilic <sefa1 at umbc.edu> wrote:
>>>>> > Hello all,
>>>>> >
>>>>> > I have been using the Bio.motifs PSSM search for a long time.
>>>>> Occasionally,
>>>>> > I work with genome sequences containing ambiguous bases. Biopython
>>>>> currently
>>>>> > does not support scoring sequences with ambiguous bases and I would
>>>>> like to
>>>>> > propose a change to fix that.
>>>>> >
>>>>> > Currently, the "calculate" function in PositionSpecificScoringMatrix
>>>>> class
>>>>> > checks if alphabets of both motif and sequence are
>>>>> > IUPAC.IUPACUnambiguousDNA. If they are not, a ValueError exception is
>>>>> > raised.
>>>>> >
>>>>> > The code itself, however, tolerates ambiguous bases on the sequence
>>>>> as NaN.
>>>>> > That is, given a PSSM of length L, all L-mer subsequences of the
>>>>> given
>>>>> > sequence are scored as NaN. I would like to extend it and do the
>>>>> scoring
>>>>> > properly for ambiguous sequences. For instance, if the base is Y (C
>>>>> or T),
>>>>> > it should be scored as the average of scoring it as C and as T. If
>>>>> the base
>>>>> > is N, it should be scored as the average of all bases [S(A) + S(T) +
>>>>> S(C) +
>>>>> > S(G)] / 4.
>>>>> >
>>>>> > The change needs to be done on both Python and C (_pwm.c) sides.
>>>>> What do you
>>>>> > think? If you agree, I can implement it and send a pull request.
>>>>> >
>>>>> > Cheers,
>>>>> >
>>>>> > _______________________________________________
>>>>> > Biopython-dev mailing list
>>>>> > Biopython-dev at mailman.open-bio.org
>>>>> > http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Biopython-dev mailing list
>>>> Biopython-dev at mailman.open-bio.org
>>>> http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>>>>
>>>
>>>
>>
>>
>> --
>> Bartek Wilczynski
>> ==================
>> Institute of Informatics
>> University of Warsaw
>> http://www.mimuw.edu.pl/~bartek
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20160622/f12fff68/attachment.html>


More information about the Biopython-dev mailing list