[Biopython-dev] Python_MKT

Fri Sep 6 06:38:34 UTC 2013

Hi Zheng,
I think that the utilization of MultipleSeqAlignment and other modules already implemented in the Biopython framework is the next step in developing my module. The code was made independent because it says on the Biopython wiki that, whensubmitting code, it should be generalized so I didn't use any existing Biopython modules... 
As for the multi_short_path() function - it is guaranteed to find the shortest path (as far as I've tested it and I've tested it quite a bit) but I agree that it is very  confusing (even for me), but it works...  But still, my next goal is to try and rewrite it (so thank you for the suggestions :). The codon-codon matrix principle you described is also the principle behind the multi_short_path() function and, I think, it is a good way of tackling the problem... But in the end the result of the multi _short_path() is to find a tree with the least amount of overall substitutions (synonymous + non-synonymous) and with the number of non-synonymous substitutions being minimized. If you try to connect the nodes based solely on the minimum amount of synonymous substitutions you may not always get a minimum length tree (for example: if considering only the synonymous substitutions, then, theoretically, a codon_a -> codon_b exchange which requires two synonymous changes has priority over a codon_a -> codon_c which requires only one non-synonymous change, and that in turn can affect the length of the whole tree) - I hope this makes some sense to you... Also, when connecting nodes, I took the approach of first making a root of the tree and then building the tree from that root, otherwise you could end up with multiple unconnected branches... I hope this helps with your implementation... If I come up with a better alternative to the multi_short_path() I'll be sure to post a link!
Again, thanks for taking the time to going through my code, all the best,
Juraj
Date: Fri, 6 Sep 2013 00:00:06 -0400
Subject: Fwd: [Biopython-dev] Python_MKT
From: zruan1991 at gmail.com
To: biopython-dev at biopython.org; jurajbergman at hotmail.com

Hi Juraj,

I am also planing to implement MK test into my GSoC framework. I just went through you code and it is really independent. Will you be also to modify it to utilize the MultipleSeqAlignment, Alphabet and CodonTable module of Biopython so that it is more extendable?

As to the multi_short_path() function, you really confused me. Is your implementation guaranteed to find the shortest path? This problem can be abstracted as finding the minimum spanning tree in graph theory and a good algorithm is known (Prim algorithm or Kruskal algorithm). My idea of linking multiple codons is first generate a codon by codon matrix representing the synonymous and nonsynonymous substitutions each codon needs to change to the other in advance. Then finding the minimum spanning tree that connect all the node in the matrix with minimum length (least synonymous substitutions). I plan to implement this and you may have more insight about my suggestions. Thanks!

Best,Zheng Ruan

On Thu, Sep 5, 2013 at 10:33 AM, Juraj Bergman <jurajbergman at hotmail.com> wrote:

Dear all,

I'm resending my implementation of the McDonald-Kreitman test.

Link to the description of the module:https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdf

Link to the code:https://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py

I apologise for the initial mistake of sending attachments instead of links.

Kind regards,

Juraj Bergman

P.S. Regarding the multi_short_path() function - I realize that it is very, very repetitive butI have not (yet) managed to find a suitable loop construction that would replace the current code. The multi_short_path() function is by far the most complex function of the modulebecause its purpose is to find the codon network with the least amount of overall nucleotide substitutions and the least amount of non-synonymous nucleotide substitutions (given any combination of codons). Each codon is being represented as multiple lists of two integers (depending on the overall amount of codons being processed). The first integer specifies the amount of synonymous and the second specifies the amount of non-synonymous substitutions.For example, if 10 codons are being fitted in a network, then there are 10x10 = 100 combinations of codon-codon pathways, each represented with a two-integer list, and out of these 100 lists, the 'best' 10 have to be chosen to get the most optimal codon networ!

 k (and the repetitiveness of thefunction mainly arises because of this process). This is, in short, a description of the function and I would appreciate any pointers that would help to make the code more succinct :)

_______________________________________________

Biopython-dev mailing list

Biopython-dev at lists.open-bio.org

http://lists.open-bio.org/mailman/listinfo/biopython-dev