[Biopython-dev] Plan for Bio.CodonAlign development

Wed May 14 16:07:06 UTC 2014

On May 13, 2014 8:16 PM, "Zheng Ruan" <zruan1991 at gmail.com> wrote:
>
> Hi all,
>
>
> In this summer, I would like to further enhance the CodonAlign
>
> module that I developed last year. Here are a couple of things
>
> in my mind. Any suggestions are greatly appreciated.
>
>
> 1) Right now, the most awkward step to build a Codon Alignment
>
> using Bio.CodonAlign is how to accurately match protein sequences
>
> to nucleotide sequences. If there are multiple insertion
>
> (frameshift) events in nucleotide sequence, the current code will
>
> not work. To address this, some third party program such as
>
> exonerate will help. I would like to add an option, so that the
>
> code will accept a file containing information of amino acids
>
> -- nucleotides correspondence produced by exonerate. Or to allow
>
> the program to call exonerate internally to get the
>
> correspondence info.
>

Sounds good to me. Is the exonerate algorithm for this case simple enough
to implement in Biopython directly?

>
> 2) The Bio.CodonAlign module now contains 3 counting based
>
> methods and 1 ML method for dN, dS estimation. I noticed that
>
> the result produced by the code is slightly different from what
>
> PAML gives. I will dig into this and figure out the reason. Some
>
> more dN, dS estimation methods will also be added.
>
>
> 3) The code for chisq test for MKtest is borrowed from Eric. I
>
> will look into the biopython's own version of chi2
>
> (Bio.Phylo.PAML.chi2
> http://web.archiveorange.com/archive/v/5dAwXsd7pIljyMSmtWeb)
>
> and make it work for my purpose. The correction of counts in
>
> MKtest will also be implemented.
>

I ported that code from SciPy's C implementation; Brandon Invergo ported
the other one from PAML. Both are established algorithms, I think. It would
be worth benchmarking both and also comparing the outputs. Then the
"winner" could be moved to Bio.Statistics and used in both places.

Python 2.7 includes a log-gamma function in the math module, which could be
conditionally used to speed up our Python chisq function.

>
> 4) If time permits, I want to implement the BEB (Bayes
>
> Empiricial Bayes) approach to infer sites under positive
>
> selection. I'm not sure if the algorithm is suitable for a
>
> python implementation since it's slow under pure C (codeml).
>
> But I'll at least give a try.
>

Cool. Maybe scipy/numpy will be fast enough to still make it usable.

>
> Thank you!
>
> Zheng Ruan
>

Thank you!
_______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev