<div dir="ltr">Thanks Bartek - very useful comments!<div><br></div><div>I was also worried about changing the default - making this into</div><div>a separate method would be even more explicit than the optional</div><div>argument I was suggesting.<br><div><br></div><div>(Apologies, I had a typo in Michiel's email address.)</div><div><br></div><div>Peter<br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 22, 2016 at 10:08 AM, Bartek Wilczynski <span dir="ltr"><<a href="mailto:barwil@gmail.com" target="_blank">barwil@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div><div>Dear Peter and Sefa, <br><br></div>I was involved in Bio.Motif, but not so much in Bio.motifs anymore. Since you ask me personally, and I think this is an important issue, I can certainly give you comments, and you can make decisions accordingly. <br><br>This is a big change inĀ the sense, that it can potentially:<br></div>- give surprising results to some people (the semantic you suggest is by no means standard, and the usual interpretation of this score is log-odds, so taking arithmetic mean of log odds is questionable, however arguably natural to some users)<br></div>- allow for some errors (using DNA motif on protein sequence, etc) that are currently prevented by strict checking<br></div>- slow things down when people are actually searching for unambiguous DNA motifs<br><br></div>If you think this functionality is important (I can see many places where it would come in handy), I'd suggest considering writing a new function. like "calculate_ambiguous" or something similar, that would be slower and have a certain semantic (potentially, user selected - I could see people interested in min and max semantic in addition to average proposed by Sefa).<br><br></div>That's pretty much my 2 cents<br></div>B<br></div><div class="gmail_extra"><div><div class="h5"><br><div class="gmail_quote">On Wed, Jun 22, 2016 at 10:56 AM, Peter Cock <span dir="ltr"><<a href="mailto:p.j.a.cock@googlemail.com" target="_blank">p.j.a.cock@googlemail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Sefa,<div><br></div><div>It looks like Michiel is extra busy at the moment, but in the absence<div>of his input, perhaps Bartek has some thoughts (he was the original</div><div>author of the motif module)?</div><div><br></div><div>I would suggest the new mode might need to be an option, e.g.</div><div>default to the current NaN results, but easy to pick your proposed</div><div>mean scoring. Might the minimum or maximum ever be useful?</div><div><br></div><div>If you (Sefa) want to go ahead and fork the repository and explore</div><div>this on a branch leading to a potential pull request that seems sensible.</div><div><br></div><div>Note we define the IUPAC ambiguity codes centrally in Python in</div><div>Bio/Data/IUPACData.py which you can import for the new Python</div><div>code. I can see arguments for and against having them hard coded</div><div>in the proposed new C code.</div><div><br></div><div>Thanks,</div><div><br></div><div>Peter<br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 22, 2016 at 2:37 AM, Sefa Kilic <span dir="ltr"><<a href="mailto:sefa1@umbc.edu" target="_blank">sefa1@umbc.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div style="font-family:arial,helvetica,sans-serif;font-size:small">Any thoughts?</div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jun 13, 2016 at 10:58 AM, Peter Cock <span dir="ltr"><<a href="mailto:p.j.a.cock@googlemail.com" target="_blank">p.j.a.cock@googlemail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">What do you think Michiel?<br>
<br>
Also related, earlier today I filed this issue:<br>
<a href="https://github.com/biopython/biopython/issues/851" rel="noreferrer" target="_blank">https://github.com/biopython/biopython/issues/851</a><br>
<br>
Peter<br>
<div><div><br>
On Mon, Jun 13, 2016 at 3:26 PM, Sefa Kilic <<a href="mailto:sefa1@umbc.edu" target="_blank">sefa1@umbc.edu</a>> wrote:<br>
> Hello all,<br>
><br>
> I have been using the Bio.motifs PSSM search for a long time. Occasionally,<br>
> I work with genome sequences containing ambiguous bases. Biopython currently<br>
> does not support scoring sequences with ambiguous bases and I would like to<br>
> propose a change to fix that.<br>
><br>
> Currently, the "calculate" function in PositionSpecificScoringMatrix class<br>
> checks if alphabets of both motif and sequence are<br>
> IUPAC.IUPACUnambiguousDNA. If they are not, a ValueError exception is<br>
> raised.<br>
><br>
> The code itself, however, tolerates ambiguous bases on the sequence as NaN.<br>
> That is, given a PSSM of length L, all L-mer subsequences of the given<br>
> sequence are scored as NaN. I would like to extend it and do the scoring<br>
> properly for ambiguous sequences. For instance, if the base is Y (C or T),<br>
> it should be scored as the average of scoring it as C and as T. If the base<br>
> is N, it should be scored as the average of all bases [S(A) + S(T) + S(C) +<br>
> S(G)] / 4.<br>
><br>
> The change needs to be done on both Python and C (_pwm.c) sides. What do you<br>
> think? If you agree, I can implement it and send a pull request.<br>
><br>
> Cheers,<br>
><br>
</div></div>> _______________________________________________<br>
> Biopython-dev mailing list<br>
> <a href="mailto:Biopython-dev@mailman.open-bio.org" target="_blank">Biopython-dev@mailman.open-bio.org</a><br>
> <a href="http://mailman.open-bio.org/mailman/listinfo/biopython-dev" rel="noreferrer" target="_blank">http://mailman.open-bio.org/mailman/listinfo/biopython-dev</a><br>
</blockquote></div><br></div>
</div></div><br>_______________________________________________<br>
Biopython-dev mailing list<br>
<a href="mailto:Biopython-dev@mailman.open-bio.org" target="_blank">Biopython-dev@mailman.open-bio.org</a><br>
<a href="http://mailman.open-bio.org/mailman/listinfo/biopython-dev" rel="noreferrer" target="_blank">http://mailman.open-bio.org/mailman/listinfo/biopython-dev</a><br></blockquote></div><br></div></div></div></div>
</blockquote></div><br><br clear="all"><br></div></div><span class="HOEnZb"><font color="#888888">-- <br><div data-smartmail="gmail_signature">Bartek Wilczynski<br>==================<br>Institute of Informatics<br>University of Warsaw<br><a href="http://www.mimuw.edu.pl/~bartek" target="_blank">http://www.mimuw.edu.pl/~bartek</a><br></div>
</font></span></div>
</blockquote></div><br></div></div></div></div>