[Biopython-dev] pypaml

Brandon Invergo b.invergo at gmail.com
Fri Jun 10 10:53:12 UTC 2011


Hi everyone,

It's been quite a while since I've updated you with my PAML progress.
My side projects had to take a back seat to my PhD research for a
while, so I couldn't work on it. Anyway, I finally got back to it and
implemented some much-needed restructuring as suggested. I've
implemented a generalized PAML class which the others inherit from to
reduce duplicated code. I've also created files containing helper
functions explicitly for parsing the result files. So, the
Codeml.read() function now does all of its parsing through functions
held in the _parse_codeml.py file. This was done for both clarity and
cleanliness. I've taken the suggestion to split the parsing task into
several functions so I hope it's all a bit more readable now. I
certainly think it is; I was hesitant at first but now that it's done
I see how much better it is. I also caught some really poor parts of
the code which I was able to fix. Codeml parsing remains a bit
complicated compared to Baseml and Yn00 but I think that's just the
nature of the beast.

So, if anyone has a moment, could you take a moment to do a quick code
review? Barring any major changes, I'll send a message to the users
list to see if people would be willing to take it for a test drive.
https://github.com/brandoninvergo/biopython/tree/paml-branch/Bio/Phylo/PAML

There is one other problem. As you may recall, I decided to
reimplement the Chi2 program from the PAML package to provide a
convenient means to do likelihood ratio testing without having to load
another library (scipy, rpy2). The original was written in C but had
limited command-line options so I couldn't just write an interface to
it. Re-writing the code in Python seemed to work fine, as far as
getting the correct results/output. However, I later found that doing
tests with large degrees of freedom (one codeml model comparison
requires 41 df) takes an exorbitant amount of time compared to the C
code. So, I see three options: dig into the code to try to find ways
to optimize it, look into something like Weave for compiling the C
code into a Python module, or just remove Chi2 for now and wait for
him to release a version that takes command line arguments (which he
claims is coming in the next version). Any thoughts on this matter?

Cheers,
Brandon

On Mon, Feb 28, 2011 at 5:35 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Brandon;
>
> [pypaml branch: https://github.com/brandoninvergo/biopython/tree/paml-branch]
>
> [base class]
>> This is a really good idea and I'm a bit disappointed that I didn't
>> see it myself! Indeed, most of the functionality is just copied/pasted
>> between the classes, with only some variation in the
>> read/write_ctl_file functions for codeml and baseml. So, writing a
>> base class would really simplify things. I do have one question,
>> though, since this is my first time organizing my code in a
>> large-scale Python project. Where would be the best place to implement
>> this base paml class? In __init__.py or in its own paml.py file? I
>> know the end result would be the same but I figure I should start
>> learning some of these best practices.
>
> It's always easier to get perspective on code when you haven't been
> directly in the middle of it. Even if you don't have someone to do
> code reviews, stepping away from a project and coming back later
> will often lead to a bunch of insights.
>
> For the base class, I would follow Eric and Peter's example and use
> files in the same directory with an underscore: something like _shared.py
> or _base.py.
>
> [read functions]
>> This mess is precisely why I had to include so many different
>> output files for the unittesting (codeml is the main culprit; baseml
>> is moderately bad; yn00 isn't a problem)
>
> I definitely feel your pain on this. This is exactly why your work
> doing this is appreciated; you'll save someone a lot of headache
> later on.
>
>> So, because I would potentially end up scanning almost the entire file
>> just to figure out what's going on, I think just parsing-as-you-go,
>> using elif statements to short-circuit and skip further evaluations of
>> a line after a match has been found, would be the better option.
>> Perhaps the files aren't long enough to be able to make an appeal for
>> computational efficiency but at the same time, I hesitate to read
>> through the file multiple times unnecessarily. I agree, though, that
>> this makes the read() function quite long. For that, though, I tried
>> to provide descriptive comments before each parsing case, describing
>> exactly what the next block of code is meant to parse and also
>> including a specific example line which should be parsed by it.
>
> The issue really is that deeply nested code is hard to read,
> long functions are hard to read, and when you combine them together
> it just makes it very difficult for others to follow your logic.
>
> I don't think you necessarily have to make multiple passes to parse it
> in a more structure way, but what you would want to focus on is making
> the flow through the function simpler. The way I would normally attack
> this is to break components into smaller more re-usable functions.
> Here's a concrete example from the start of the codeml parser:
>
> https://github.com/brandoninvergo/biopython/blob/paml-branch/Bio/Phylo/PAML/codeml.py
>
> siteclass_re = re.match("Site-class models:\s*(.*)", line)
> if siteclass_re is not None:
>    siteclass_model = siteclass_re.group(1)
>    if siteclass_model == "":
>        multi_models = True
>        continue
>    results["site-class model"] = siteclass_model
>    if siteclass_model == "NearlyNeutral":
>        current_model = 1
>        results["NSsites"][current_model] = \
>            {"description":siteclass_model}
>        if 0 in results["NSsites"]:
>            del results["NSsites"][0]
>    elif siteclass_model == "PositiveSelection":
>        current_model = 2
>        results["NSsites"][current_model] = \
>            {"description":siteclass_model}
>        if 0 in results["NSsites"]:
>            del results["NSsites"][0]
>    elif siteclass_model == "discrete (4 categories)":
>        current_model = 3
>        results["NSsites"][current_model] = \
>            {"description":siteclass_model}
>        if 0 in results["NSsites"]:
>            del results["NSsites"][0]
>    elif siteclass_model == "beta (4 categories)":
>        current_model = 7
>        results["NSsites"][current_model] = \
>            {"description":siteclass_model}
>        if 0 in results["NSsites"]:
>            del results["NSsites"][0]
>    elif siteclass_model == "beta&w>1 (5 categories)":
>        current_model = 8
>        results["NSsites"][current_model] = \
>            {"description":siteclass_model}
>        if 0 in results["NSsites"]:
>            del results["NSsites"][0]
>
> You could refactor this something along the lines of:
>
> class _CodemlParser:
>    def __init__(self):
>        self.results = {}
>        self.flags = dict(multi_models = False)
>
>    def read(self, results_handle):
>        for line in results_handle:
>            siteclass_re = re.match("Site-class models:\s*(.*)", line)
>            if siteclass_re is not None:
>                self._siteclass_parse(siteclass_re)
>
>    def _add_siteclass_model(self, siteclass_model):
>        self.results["site-class model"] = siteclass_model
>        name_to_num = {"NearlyNeutral": 1,
>                       "PositiveSelection": 2,
>                       "discrete (4 categories)": 3,
>                       "beta (4 categories)": 7
>                       "beta&w>1 (5 categories)": 8}
>        current_model = name_to_num[siteclass_model]
>        self.results["NSsites"][current_model] = {"description":siteclass_model}
>        if 0 in results["NSsites"]:
>            del results["NSsites"][0]
>
>    def _siteclass_parse(self, siteclass_re):
>        if siteclass_model == "":
>            self.flags["multi_models"] = True
>        else:
>            self._add_siteclass_model(siteclass_model)
>
> You are not changing the parsing strategy, but now you've got
> individual functions handling each of the steps so it's clear that
> the _siteclass_parse either sets multi_models or adds details about
> the single model. Then you can dig into the _add_siteclass_model
> function to see what it is doing. To the reader, each individual
> unit can be read and understood separately.
>
> This type of refactoring work is useful generally. I have to do it all
> the time in my work and discover new tricks and approaches. Hope this
> is helpful and thanks again for all the work on this,
> Brad
>




More information about the Biopython-dev mailing list