b.invergo at gmail.com
Sat Jan 15 12:20:05 UTC 2011
I'll reply to both Eric and Peter in this one...
>>> The functionality here looks great. My stylistic suggestion would be
>>> to separate the code for running the commandline from that used to
>>> parse the output file. Ideally these would be two separate classes
>>> that could live under the Bio.Phylo namespace:
>> I agree.
> That sounds good. This will be a big change for anyone already
> using the stand alone pypaml - but some changes are unavoidable.
I plan to make a tag of the current version on Google Code and then
branch it and start making these structural changes. I'll put a notice
on the main page to let the users know how things will be changing as
I prepare to migrate to Biopython. It'll be a slow, steady process.
>> For the commandline code, it would be nice to have a
>>> Bio.Phylo.Applications that is organized similar to
>>> This will give you some flexibility as you want to expand out to
>>> support other programs, and provide a framework for additional
>>> phylogenetic commandline utilities.
>> Since it sounds like you might eventually write wrappers for other programs
>> in the PAML suite, a layout like this might work:
>> -- just the wrapper for running the command-line program, perhaps based on
>> the Bio.Application classes. The API for calling the wrapper goes through
>> __init__.py; the user doesn't import this module directly. (See
> Roughly how many applications are there in PAML? What Brad and
> Eric have outlined would work fine, but we could opt for something
> a little different, like the namespace Bio.Phylo.Applications for
> general tools (there are some tree building tools I could write
> wrappers for - using the same setup as Bio.Align.Applications),
> and have namespace Bio.Phylo.Applications.PAML for the PAML
> wrappers. Another reason to separate them is they won't be
> using the simple Bio.Application framework (due to the way
> PAML options must be specified via input files).
There are 8 programs in PAML. Copied from the manual:
• Comparison and tests of phylogenetic trees (baseml and codeml);
• Estimation of parameters in sophisticated substitution models,
including models of variable rates among sites and models for combined
analysis of multiple genes or site partitions (baseml and codeml);
• Likelihood ratio tests of hypotheses through comparison of
implemented models (baseml, codeml, chi2);
• Estimation of divergence times under global and local clock models
(baseml and codeml);
• Likelihood (Empirical Bayes) reconstruction of ancestral sequences
using nucleotide, amino acid and codon models (baseml and codeml);
• Generation of datasets of nucleotide, codon, and amino acid sequence
by Monte Carlo simulation (evolver);
• Estimation of synonymous and nonsynonymous substitution rates and
detection of positive selection in protein-coding DNA sequences (yn00
• Bayesian estimation of species divergence times incorporating
uncertainties in fossil calibrations (mcmctree).
>> Yes. Also, the user might have saved the output from a codeml run
>> previously (maybe from a shell script/pipeline), and want to parse it
>> without re-running codeml through a Python wrapper. Right? (Sorry
>> if I misunderstood your code.)
Actually, it currently does support doing this. The parse_results()
function takes a string filename as an argument so you can call it
without having run any analyses yet. Still, it makes more sense to
make the parser a separate class. What I'm torn about is to either
have a single PAML parser class or to have separate parsers for each
program. The output files contain the program name in the first line
so it's simple enough to determine what kind of output you're looking
at, but the code might get a bit long and cumbersome.
Thanks for the input everyone. I'll have a lot of things done this
weekend I hope (it's a busy one with other projects at the same time).
More information about the Biopython-dev