[Biopython-dev] Bio.motifs.matrix.PositionSpecificScoringMatrix.calculate - scoring ambiguous sequences

Bartek Wilczynski barwil at gmail.com
Wed Jun 22 09:08:13 UTC 2016


Dear Peter and Sefa,

I was involved in Bio.Motif, but not so much in Bio.motifs anymore. Since
you ask me personally, and I think this is an important issue, I can
certainly give you comments, and you can make decisions accordingly.

This is a big change in  the sense, that it can potentially:
- give surprising results to some people (the semantic you suggest is by no
means standard, and the usual interpretation of this score is log-odds, so
taking arithmetic mean of log odds is questionable, however arguably
natural to some users)
- allow for some errors (using DNA motif on protein sequence, etc) that are
currently prevented by strict checking
- slow things down when people are actually searching for unambiguous DNA
motifs

If you think this functionality is important (I can see many places where
it would come in handy), I'd suggest considering writing a new function.
like "calculate_ambiguous" or something similar, that would be slower and
have a certain semantic (potentially, user selected - I could see people
interested in min and max semantic in addition to average proposed by Sefa).

That's pretty much my 2 cents
B

On Wed, Jun 22, 2016 at 10:56 AM, Peter Cock <p.j.a.cock at googlemail.com>
wrote:

> Hi Sefa,
>
> It looks like Michiel is extra busy at the moment, but in the absence
> of his input, perhaps Bartek has some thoughts (he was the original
> author of the motif module)?
>
> I would suggest the new mode might need to be an option, e.g.
> default to the current NaN results, but easy to pick your proposed
> mean scoring. Might the minimum or maximum ever be useful?
>
> If you (Sefa) want to go ahead and fork the repository and explore
> this on a branch leading to a potential pull request that seems sensible.
>
> Note we define the IUPAC ambiguity codes centrally in Python in
> Bio/Data/IUPACData.py which you can import for the new Python
> code. I can see arguments for and against having them hard coded
> in the proposed new C code.
>
> Thanks,
>
> Peter
>
> On Wed, Jun 22, 2016 at 2:37 AM, Sefa Kilic <sefa1 at umbc.edu> wrote:
>
>> Any thoughts?
>>
>> On Mon, Jun 13, 2016 at 10:58 AM, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>>
>>> What do you think Michiel?
>>>
>>> Also related, earlier today I filed this issue:
>>> https://github.com/biopython/biopython/issues/851
>>>
>>> Peter
>>>
>>> On Mon, Jun 13, 2016 at 3:26 PM, Sefa Kilic <sefa1 at umbc.edu> wrote:
>>> > Hello all,
>>> >
>>> > I have been using the Bio.motifs PSSM search for a long time.
>>> Occasionally,
>>> > I work with genome sequences containing ambiguous bases. Biopython
>>> currently
>>> > does not support scoring sequences with ambiguous bases and I would
>>> like to
>>> > propose a change to fix that.
>>> >
>>> > Currently, the "calculate" function in PositionSpecificScoringMatrix
>>> class
>>> > checks if alphabets of both motif and sequence are
>>> > IUPAC.IUPACUnambiguousDNA. If they are not, a ValueError exception is
>>> > raised.
>>> >
>>> > The code itself, however, tolerates ambiguous bases on the sequence as
>>> NaN.
>>> > That is, given a PSSM of length L, all L-mer subsequences of the given
>>> > sequence are scored as NaN. I would like to extend it and do the
>>> scoring
>>> > properly for ambiguous sequences. For instance, if the base is Y (C or
>>> T),
>>> > it should be scored as the average of scoring it as C and as T. If the
>>> base
>>> > is N, it should be scored as the average of all bases [S(A) + S(T) +
>>> S(C) +
>>> > S(G)] / 4.
>>> >
>>> > The change needs to be done on both Python and C (_pwm.c) sides. What
>>> do you
>>> > think? If you agree, I can implement it and send a pull request.
>>> >
>>> > Cheers,
>>> >
>>> > _______________________________________________
>>> > Biopython-dev mailing list
>>> > Biopython-dev at mailman.open-bio.org
>>> > http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>>>
>>
>>
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>>
>
>


-- 
Bartek Wilczynski
==================
Institute of Informatics
University of Warsaw
http://www.mimuw.edu.pl/~bartek
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20160622/d80fd894/attachment-0001.html>


More information about the Biopython-dev mailing list