[Biopython-dev] pypaml

Fri Feb 25 16:57:19 UTC 2011

Hi Brad,
Thanks for your response! It's taken me a day or two to think about
what you wrote (also balancing a PhD with the hobby projects at the
moment...)

> It's really great to see the extensive tests. I'm also impressed
> with your story of porting over 'goto' statements; it's been a while
> since those have entered my mind:

To be honest, I forgot they existed. Seeing them immediately made the
computer scientist in me cringe. They really confused the whole
structure of the program but in the end they were solved quite easily
with some carefully placed loops and conditional blocks!

> - These looks to be a lot of shared functionality between codeml,
>  baseml and yn00 in setting up the control files. Would it be
>  possible to create a base class that these all inherit from? This
>  would make the code much easier to maintain over time as formats
>  change.

This is a really good idea and I'm a bit disappointed that I didn't
see it myself! Indeed, most of the functionality is just copied/pasted
between the classes, with only some variation in the
read/write_ctl_file functions for codeml and baseml. So, writing a
base class would really simplify things. I do have one question,
though, since this is my first time organizing my code in a
large-scale Python project. Where would be the best place to implement
this base paml class? In __init__.py or in its own paml.py file? I
know the end result would be the same but I figure I should start
learning some of these best practices.

> - Your 'read' functions get pretty deeply nested, especially the
>  codeml parser. What do you think about creating an internal class
>  to split some of the parsing logic into individual functions? A
>  nice example is the GenBank/Scanner.py code. Having functions like
>  parse_header/parse_features makes it much easier for someone not
>  deeply familiar with your code to start to make guesses at where
>  different functionality exists. This way, if the format changes
>  others can provide patches and feedback to you.

I'm not so sure about this mainly because of the way the output files
are formatted. For example, the most common usage of codeml (the most
common program of the bunch) is to run with several several "NSsites"
models. If you do this, the output file is separated into segments
which are headed by a line that says something like "Model 2:
PositiveSelection", and the model parameters are printed out below.
However, if you only run with one model, which is also a common usage,
you no longer have these convenient headers and instead at the very
top of the output file is a completely different indication of which
model was used, but which is inconveniently missing if only model 0
was run. In other cases, such as amino acid sequence analysis,
pairwise nucleotide sequence or multiple gene analyses, there's no
header whatsoever indicating which kind of output file you're looking
at. Instead, you just have to search for particular data patterns to
parse. This mess is precisely why I had to include so many different
output files for the unittesting (codeml is the main culprit; baseml
is moderately bad; yn00 isn't a problem)

So, because I would potentially end up scanning almost the entire file
just to figure out what's going on, I think just parsing-as-you-go,
using elif statements to short-circuit and skip further evaluations of
a line after a match has been found, would be the better option.
Perhaps the files aren't long enough to be able to make an appeal for
computational efficiency but at the same time, I hesitate to read
through the file multiple times unnecessarily. I agree, though, that
this makes the read() function quite long. For that, though, I tried
to provide descriptive comments before each parsing case, describing
exactly what the next block of code is meant to parse and also
including a specific example line which should be parsed by it.

That said, I will take another look at the output files to see if
there could be another way of implementing it. Without a doubt, the
parsing is the most difficult part of implementing this module; the
rest of it is quite trivial. So, best to do it right!

> Overall this is great and all the work is much appreciated.

Thanks! It's been a fun side project for me.

Cheers,
Brandon

ps - I still haven't sent a message to the main Biopython list while I
consider implementing at least the first suggestion above, since it
would involve large changes that might cause me to accidentally break
something! I'll wait until I'm a bit more confident that it's close to
the final product