[Biopython-dev] pypaml

Eric Talevich eric.talevich at gmail.com
Sat Jan 15 18:23:19 UTC 2011

On Sat, Jan 15, 2011 at 7:20 AM, Brandon Invergo <b.invergo at gmail.com>wrote:

> I'll reply to both Eric and Peter in this one...
> >>> The functionality here looks great. My stylistic suggestion would be
> >>> to separate the code for running the commandline from that used to
> >>> parse the output file. Ideally these would be two separate classes
> >>> that could live under the Bio.Phylo namespace:
> >>>
> >>> https://github.com/biopython/biopython/tree/master/Bio/Phylo
> >>>
> >>
> >> I agree.
> >
> > That sounds good. This will be a big change for anyone already
> > using the stand alone pypaml - but some changes are unavoidable.
> I plan to make a tag of the current version on Google Code and then
> branch it and start making these structural changes. I'll put a notice
> on the main page to let the users know how things will be changing as
> I prepare to migrate to Biopython. It'll be a slow, steady process.

Sounds great to me! Slow and steady wins the race.

 >> For the commandline code, it would be nice to have a
> >>> Bio.Phylo.Applications that is organized similar to
> >>> Bio.Align.Applications:
> >>>
> >>>
> https://github.com/biopython/biopython/tree/master/Bio/Align/Applications
> >>>
> >>> This will give you some flexibility as you want to expand out to
> >>> support other programs, and provide a framework for additional
> >>> phylogenetic commandline utilities.
> >>>
> >>
> >> Since it sounds like you might eventually write wrappers for other
> programs
> >> in the PAML suite, a layout like this might work:
> >>
> >> Bio/Phylo/Applications/_codeml.py
> >>  -- just the wrapper for running the command-line program, perhaps based
> on
> >> the Bio.Application classes. The API for calling the wrapper goes
> through
> >> __init__.py; the user doesn't import this module directly. (See
> >> Bio.Align.Applications)
> >>
> >
> > Roughly how many applications are there in PAML? What Brad and
> > Eric have outlined would work fine, but we could opt for something
> > a little different, like the namespace Bio.Phylo.Applications for
> > general tools (there are some tree building tools I could write
> > wrappers for - using the same setup as Bio.Align.Applications),
> > and have namespace Bio.Phylo.Applications.PAML for the PAML
> > wrappers. Another reason to separate them is they won't be
> > using the simple Bio.Application framework (due to the way
> > PAML options must be specified via input files).

If the Bio.Applications framework won't work for codeml/PAML then I think it
would be misleading to put any pypaml code under Bio.Phylo.Applications (at
least for now). Later we might find a way to put PAML options into named
temporary files and run the command-line applications that way, but that's
probably not a priority yet.

*Code Philosophy*:
It's my understanding that tightly nested namespaces are nicer for library
developers, but flatter namespaces are nicer for the users of those
libraries (especially those who don't use full-featured IDEs like Eclipse).
Python lets us have it both ways, to some extent, by importing protected
modules to a higher-level namespace. See if you agree with examples like

# Common functionality and generalized tree I/O is available at the top
>>> from Bio import Phylo

# Everything under *.Applications directly uses the Bio.Application
>>> from Bio.Phylo.Applications import RAxMLCommandline
>>> from Bio.Phylo.Applications import MrBayesCommandline

# Extra functionality for a popular application suite goes in a separate
>>> from Bio.Phylo.PAML import codeml

# A namespace for web services reminds the user that the network will be
>>> from Bio.Phylo.WWW import Dryad

This is basically what we proposed with Bio.Struct for GSoC 2010, and I
don't think any of it contradicts the existing conventions of Bio.Align.

Namespace collisions are unlikely: the sub-packages would generally be
either support for new file formats or helpers for application suites, and
those would only match if an application suite defined its own file formats
-- in which case the modules do belong under the same sub-package.

 >> Yes. Also, the user might have saved the output from a codeml run
> >> previously (maybe from a shell script/pipeline), and want to parse it
> >> without re-running codeml through a Python wrapper. Right? (Sorry
> >> if I misunderstood your code.)
> Actually, it currently does support doing this. The parse_results()
> function takes a string filename as an argument so you can call it
> without having run any analyses yet. Still, it makes more sense to
> make the parser a separate class. What I'm torn about is to either
> have a single PAML parser class or to have separate parsers for each
> program. The output files contain the program name in the first line
> so it's simple enough to determine what kind of output you're looking
> at, but the code might get a bit long and cumbersome.

I'd recommend splitting the parsers into separate modules. Small functions
and classes are much easier to maintain.

If everyone agrees with this layout, I'd suggest putting your existing
__init__.py and codeml.py under Bio/Phylo/PAML/.
Inside codeml.py, I'd suggest:

1. Have the run() method raise an exception when the subprocess return code
is non-zero, instead of returning the subprocess return code directly (try
subprocess.check_call in place of subprocess.call, or see
Bio/Application/__init__.py). Most of the time the user will want to throw
an error of some sort if the command line fails; this is more direct.

Then, since run() no longer needs to return an integer, it's free to return
the results dictionary instead.

2. Change parse_results() to return a dictionary, rather than setting it on
self.results. So the run() function retrieves this dictionary by calling
parse_results(), then returns it (after chdir'ing).

3. Now that parse_results() doesn't need direct access to self._results,
move it out of the codeml class and rename it as a standalone function:

def read(results_file, version=None):

Any other optional info that parse_results()/read() needs can also be passed
as keyword arguments -- I'm not sure if I missed any places where that's

This is the same overall change Brad was suggesting, I think. It also brings
the style of pypaml/codeml pretty much in line with how Biopython and
Bio.Application work, so further integration would be easier in the future.


More information about the Biopython-dev mailing list