[Biopython-dev] pypaml

Brad Chapman chapmanb at 50mail.com
Mon Feb 28 16:35:21 UTC 2011


Brandon;

[pypaml branch: https://github.com/brandoninvergo/biopython/tree/paml-branch]

[base class]
> This is a really good idea and I'm a bit disappointed that I didn't
> see it myself! Indeed, most of the functionality is just copied/pasted
> between the classes, with only some variation in the
> read/write_ctl_file functions for codeml and baseml. So, writing a
> base class would really simplify things. I do have one question,
> though, since this is my first time organizing my code in a
> large-scale Python project. Where would be the best place to implement
> this base paml class? In __init__.py or in its own paml.py file? I
> know the end result would be the same but I figure I should start
> learning some of these best practices.

It's always easier to get perspective on code when you haven't been
directly in the middle of it. Even if you don't have someone to do 
code reviews, stepping away from a project and coming back later
will often lead to a bunch of insights.

For the base class, I would follow Eric and Peter's example and use
files in the same directory with an underscore: something like _shared.py 
or _base.py.

[read functions]
> This mess is precisely why I had to include so many different
> output files for the unittesting (codeml is the main culprit; baseml
> is moderately bad; yn00 isn't a problem)

I definitely feel your pain on this. This is exactly why your work
doing this is appreciated; you'll save someone a lot of headache
later on.

> So, because I would potentially end up scanning almost the entire file
> just to figure out what's going on, I think just parsing-as-you-go,
> using elif statements to short-circuit and skip further evaluations of
> a line after a match has been found, would be the better option.
> Perhaps the files aren't long enough to be able to make an appeal for
> computational efficiency but at the same time, I hesitate to read
> through the file multiple times unnecessarily. I agree, though, that
> this makes the read() function quite long. For that, though, I tried
> to provide descriptive comments before each parsing case, describing
> exactly what the next block of code is meant to parse and also
> including a specific example line which should be parsed by it.

The issue really is that deeply nested code is hard to read,
long functions are hard to read, and when you combine them together
it just makes it very difficult for others to follow your logic.

I don't think you necessarily have to make multiple passes to parse it
in a more structure way, but what you would want to focus on is making
the flow through the function simpler. The way I would normally attack
this is to break components into smaller more re-usable functions.
Here's a concrete example from the start of the codeml parser:

https://github.com/brandoninvergo/biopython/blob/paml-branch/Bio/Phylo/PAML/codeml.py

siteclass_re = re.match("Site-class models:\s*(.*)", line)
if siteclass_re is not None:
    siteclass_model = siteclass_re.group(1)
    if siteclass_model == "":
        multi_models = True
        continue
    results["site-class model"] = siteclass_model
    if siteclass_model == "NearlyNeutral":
        current_model = 1
        results["NSsites"][current_model] = \
            {"description":siteclass_model}
        if 0 in results["NSsites"]:
            del results["NSsites"][0]
    elif siteclass_model == "PositiveSelection":
        current_model = 2
        results["NSsites"][current_model] = \
            {"description":siteclass_model}
        if 0 in results["NSsites"]:
            del results["NSsites"][0]
    elif siteclass_model == "discrete (4 categories)":
        current_model = 3
        results["NSsites"][current_model] = \
            {"description":siteclass_model}
        if 0 in results["NSsites"]:
            del results["NSsites"][0]
    elif siteclass_model == "beta (4 categories)":
        current_model = 7
        results["NSsites"][current_model] = \
            {"description":siteclass_model}
        if 0 in results["NSsites"]:
            del results["NSsites"][0]
    elif siteclass_model == "beta&w>1 (5 categories)":
        current_model = 8
        results["NSsites"][current_model] = \
            {"description":siteclass_model}
        if 0 in results["NSsites"]:
            del results["NSsites"][0]

You could refactor this something along the lines of:

class _CodemlParser:
    def __init__(self):
        self.results = {}
        self.flags = dict(multi_models = False)

    def read(self, results_handle):
        for line in results_handle:
            siteclass_re = re.match("Site-class models:\s*(.*)", line)
            if siteclass_re is not None:
                self._siteclass_parse(siteclass_re)

    def _add_siteclass_model(self, siteclass_model):
        self.results["site-class model"] = siteclass_model
        name_to_num = {"NearlyNeutral": 1,
                       "PositiveSelection": 2,
                       "discrete (4 categories)": 3,
                       "beta (4 categories)": 7
                       "beta&w>1 (5 categories)": 8}
        current_model = name_to_num[siteclass_model]
        self.results["NSsites"][current_model] = {"description":siteclass_model}
        if 0 in results["NSsites"]:
            del results["NSsites"][0]

    def _siteclass_parse(self, siteclass_re):
        if siteclass_model == "":
            self.flags["multi_models"] = True
        else:
            self._add_siteclass_model(siteclass_model)

You are not changing the parsing strategy, but now you've got
individual functions handling each of the steps so it's clear that
the _siteclass_parse either sets multi_models or adds details about
the single model. Then you can dig into the _add_siteclass_model
function to see what it is doing. To the reader, each individual
unit can be read and understood separately.

This type of refactoring work is useful generally. I have to do it all 
the time in my work and discover new tricks and approaches. Hope this 
is helpful and thanks again for all the work on this,
Brad



More information about the Biopython-dev mailing list