[Biopython-dev] Bio.motifs.matrix.PositionSpecificScoringMatrix.calculate - scoring ambiguous sequences

Michiel de Hoon mjldehoon at yahoo.com
Thu Jun 23 01:41:42 UTC 2016


Or you could have an ambiguous=True|False keyword on the original function.Best,-Michiel
 

    On Thursday, June 23, 2016 5:35 AM, Sefa Kilic <sefa1 at umbc.edu> wrote:
 

 Thank you, Peter and Bartek, for the comments.
I agree that modifying existing calculate function might be confusing for users and it can slow things down. I will go ahead and create a separate function and you can merge it if you think it would be useful for others.
Cheers,
On Wed, Jun 22, 2016 at 5:17 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

Thanks Bartek - very useful comments!
I was also worried about changing the default - making this intoa separate method would be even more explicit than the optionalargument I was suggesting.

(Apologies, I had a typo in Michiel's email address.)
Peter

On Wed, Jun 22, 2016 at 10:08 AM, Bartek Wilczynski <barwil at gmail.com> wrote:

Dear Peter and Sefa, 

I was involved in Bio.Motif, but not so much in Bio.motifs anymore. Since you ask me personally, and I think this is an important issue, I can certainly give you comments, and you can make decisions accordingly. 

This is a big change in  the sense, that it can potentially:
- give surprising results to some people (the semantic you suggest is by no means standard, and the usual interpretation of this score is log-odds, so taking arithmetic mean of log odds is questionable, however arguably natural to some users)
- allow for some errors (using DNA motif on protein sequence, etc) that are currently prevented by strict checking
- slow things down when people are actually searching for unambiguous DNA motifs

If you think this functionality is important (I can see many places where it would come in handy), I'd suggest considering writing a new function. like "calculate_ambiguous" or something similar, that would be slower and have a certain semantic (potentially, user selected - I could see people interested in min and max semantic in addition to average proposed by Sefa).

That's pretty much my 2 cents
B

On Wed, Jun 22, 2016 at 10:56 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

Hi Sefa,
It looks like Michiel is extra busy at the moment, but in the absenceof his input, perhaps Bartek has some thoughts (he was the originalauthor of the motif module)?
I would suggest the new mode might need to be an option, e.g.default to the current NaN results, but easy to pick your proposedmean scoring. Might the minimum or maximum ever be useful?
If you (Sefa) want to go ahead and fork the repository and explorethis on a branch leading to a potential pull request that seems sensible.
Note we define the IUPAC ambiguity codes centrally in Python inBio/Data/IUPACData.py which you can import for the new Pythoncode. I can see arguments for and against having them hard codedin the proposed new C code.
Thanks,
Peter

On Wed, Jun 22, 2016 at 2:37 AM, Sefa Kilic <sefa1 at umbc.edu> wrote:

Any thoughts?
On Mon, Jun 13, 2016 at 10:58 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

What do you think Michiel?

Also related, earlier today I filed this issue:
https://github.com/biopython/biopython/issues/851

Peter

On Mon, Jun 13, 2016 at 3:26 PM, Sefa Kilic <sefa1 at umbc.edu> wrote:
> Hello all,
>
> I have been using the Bio.motifs PSSM search for a long time. Occasionally,
> I work with genome sequences containing ambiguous bases. Biopython currently
> does not support scoring sequences with ambiguous bases and I would like to
> propose a change to fix that.
>
> Currently, the "calculate" function in PositionSpecificScoringMatrix class
> checks if alphabets of both motif and sequence are
> IUPAC.IUPACUnambiguousDNA. If they are not, a ValueError exception is
> raised.
>
> The code itself, however, tolerates ambiguous bases on the sequence as NaN.
> That is, given a PSSM of length L, all L-mer subsequences of the given
> sequence are scored as NaN. I would like to extend it and do the scoring
> properly for ambiguous sequences. For instance, if the base is Y (C or T),
> it should be scored as the average of scoring it as C and as T. If the base
> is N, it should be scored as the average of all bases [S(A) + S(T) + S(C) +
> S(G)] / 4.
>
> The change needs to be done on both Python and C (_pwm.c) sides. What do you
> think? If you agree, I can implement it and send a pull request.
>
> Cheers,
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython-dev



_______________________________________________
Biopython-dev mailing list
Biopython-dev at mailman.open-bio.org
http://mailman.open-bio.org/mailman/listinfo/biopython-dev






-- 
Bartek Wilczynski
==================
Institute of Informatics
University of Warsaw
http://www.mimuw.edu.pl/~bartek






  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20160623/ee4e2422/attachment-0001.html>


More information about the Biopython-dev mailing list