[Biopython-dev] pypaml

Sat Jan 15 18:23:19 UTC 2011

On Sat, Jan 15, 2011 at 7:20 AM, Brandon Invergo <b.invergo at gmail.com>wrote:

> I'll reply to both Eric and Peter in this one...
>
> >>> The functionality here looks great. My stylistic suggestion would be
> >>> to separate the code for running the commandline from that used to
> >>> parse the output file. Ideally these would be two separate classes
> >>> that could live under the Bio.Phylo namespace:
> >>>
> >>> https://github.com/biopython/biopython/tree/master/Bio/Phylo
> >>>
> >>
> >> I agree.
> >
> > That sounds good. This will be a big change for anyone already
> > using the stand alone pypaml - but some changes are unavoidable.
>
> I plan to make a tag of the current version on Google Code and then
> branch it and start making these structural changes. I'll put a notice
> on the main page to let the users know how things will be changing as
> I prepare to migrate to Biopython. It'll be a slow, steady process.
>

Sounds great to me! Slow and steady wins the race.

 >> For the commandline code, it would be nice to have a
> >>> Bio.Phylo.Applications that is organized similar to
> >>> Bio.Align.Applications:
> >>>
> >>>
> https://github.com/biopython/biopython/tree/master/Bio/Align/Applications
> >>>
> >>> This will give you some flexibility as you want to expand out to
> >>> support other programs, and provide a framework for additional
> >>> phylogenetic commandline utilities.
> >>>
> >>
> >> Since it sounds like you might eventually write wrappers for other
> programs
> >> in the PAML suite, a layout like this might work:
> >>
> >> Bio/Phylo/Applications/_codeml.py
> >>  -- just the wrapper for running the command-line program, perhaps based
> on
> >> the Bio.Application classes. The API for calling the wrapper goes
> through
> >> __init__.py; the user doesn't import this module directly. (See
> >> Bio.Align.Applications)
> >>
> >
> > Roughly how many applications are there in PAML? What Brad and
> > Eric have outlined would work fine, but we could opt for something
> > a little different, like the namespace Bio.Phylo.Applications for
> > general tools (there are some tree building tools I could write
> > wrappers for - using the same setup as Bio.Align.Applications),
> > and have namespace Bio.Phylo.Applications.PAML for the PAML
> > wrappers. Another reason to separate them is they won't be
> > using the simple Bio.Application framework (due to the way
> > PAML options must be specified via input files).
>

If the Bio.Applications framework won't work for codeml/PAML then I think it
would be misleading to put any pypaml code under Bio.Phylo.Applications (at
least for now). Later we might find a way to put PAML options into named
temporary files and run the command-line applications that way, but that's
probably not a priority yet.

*Code Philosophy*:
It's my understanding that tightly nested namespaces are nicer for library
developers, but flatter namespaces are nicer for the users of those
libraries (especially those who don't use full-featured IDEs like Eclipse).
Python lets us have it both ways, to some extent, by importing protected
modules to a higher-level namespace. See if you agree with examples like
these:

# Common functionality and generalized tree I/O is available at the top
level
>>> from Bio import Phylo

# Everything under *.Applications directly uses the Bio.Application
framework
>>> from Bio.Phylo.Applications import RAxMLCommandline
>>> from Bio.Phylo.Applications import MrBayesCommandline

# Extra functionality for a popular application suite goes in a separate
sub-package
>>> from Bio.Phylo.PAML import codeml

# A namespace for web services reminds the user that the network will be
used
>>> from Bio.Phylo.WWW import Dryad

This is basically what we proposed with Bio.Struct for GSoC 2010, and I
don't think any of it contradicts the existing conventions of Bio.Align.

Namespace collisions are unlikely: the sub-packages would generally be
either support for new file formats or helpers for application suites, and
those would only match if an application suite defined its own file formats
-- in which case the modules do belong under the same sub-package.

 >> Yes. Also, the user might have saved the output from a codeml run
> >> previously (maybe from a shell script/pipeline), and want to parse it
> >> without re-running codeml through a Python wrapper. Right? (Sorry
> >> if I misunderstood your code.)
>
> Actually, it currently does support doing this. The parse_results()
> function takes a string filename as an argument so you can call it
> without having run any analyses yet. Still, it makes more sense to
> make the parser a separate class. What I'm torn about is to either
> have a single PAML parser class or to have separate parsers for each
> program. The output files contain the program name in the first line
> so it's simple enough to determine what kind of output you're looking
> at, but the code might get a bit long and cumbersome.
>

I'd recommend splitting the parsers into separate modules. Small functions
and classes are much easier to maintain.

If everyone agrees with this layout, I'd suggest putting your existing
__init__.py and codeml.py under Bio/Phylo/PAML/.
Inside codeml.py, I'd suggest:

1. Have the run() method raise an exception when the subprocess return code
is non-zero, instead of returning the subprocess return code directly (try
subprocess.check_call in place of subprocess.call, or see
Bio/Application/__init__.py). Most of the time the user will want to throw
an error of some sort if the command line fails; this is more direct.

Then, since run() no longer needs to return an integer, it's free to return
the results dictionary instead.

2. Change parse_results() to return a dictionary, rather than setting it on
self.results. So the run() function retrieves this dictionary by calling
parse_results(), then returns it (after chdir'ing).

3. Now that parse_results() doesn't need direct access to self._results,
move it out of the codeml class and rename it as a standalone function:

def read(results_file, version=None):
    ...

Any other optional info that parse_results()/read() needs can also be passed
as keyword arguments -- I'm not sure if I missed any places where that's
occurring.

This is the same overall change Brad was suggesting, I think. It also brings
the style of pypaml/codeml pretty much in line with how Biopython and
Bio.Application work, so further integration would be easier in the future.

Best,
Eric