[Biopython-dev] Features of the GSOC branch ready to be merged

Sat Jan 22 23:06:52 UTC 2011

On Fri, Jan 21, 2011 at 10:13 AM, João Rodrigues <anaryin at gmail.com> wrote:
> Hello all,
>
> I've been working on the renumbering residues, remove disordered atoms, and
> biological unit representation functions.
>
> I've made quite some changes, specially to the renumbering algorithm.
> Explanation follows:
>
> Before I simply calculated how much to subtract from each residue number
> based on the first. That worked perfectly if all residue numbers were in a
> growing progression, which was not the case for some structures. Also,
> HETATMs weren't separated from the main ATOM lines, and in many PDB files
> you see numbering starting from 1000 for example.
>
> What I coded allows for certain discrimination of HETATMs from ATOMs based
> on the SEQRES field of the PDB file header (added parsing to
> parse_pdb_header). This ensures HETATMs are numbered from 1000. I've also
> incorporated a way of filtering modified aminoacids (that show up as HETATM
> but in between ATOM lines) to be treated as ATOMs if there is no SEQRES
> header present in the PDB file by looking for a CA atom. A warning is issued
> along with this "magic" feature turning on annoucing that the results may be
> a bit unreliable..
>
> I've shown the code and the idea to the people in my lab and I got generally
> good responses, but of course they are all biased :) Have a look for
> yourselves, I created a branch for these.
>
> https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements

Hi João,

Good stuff. I see you made a nice clean revision history for us, too -- thanks!

Whitespace:
Some extra spaces crept in and are throwing off the diff in
Structure.py. Also, Structure.remove_disordered_atoms has a bunch of
blank lines in the function body; could you slim it down?

Chain.renumber_residues:
1. How about 'res_init' and 'het_init'? When I see "seed" I think RNG.
2. It looks like het_init does numbering automagically if the given
value is 0, and otherwise numbers HETATMs in the expected way starting
from the given number. How about letting "het_init=None" be the
default, so "if not het_init" triggers the magical behavior, and if a
positive integer is given then go from there. (Check that het_init >=
1 and maybe het_init == int(het_init) to allow 1.0 to be coerced to 1
if you want.)
3. I see in the last commit you added a local variable called 'magic'.
<suspicious_look /> Could you find a better name for that? I think
'guess_by_ca' would fit, if I'm reading the code correctly.
4. In the last block (lines 170-174 now), could you add a comment
explaining why it would be reached? Before this commit there was a
comment "Other HETATMs" but I'm not sure I fully understand. Is it for
HETATMs not associated with any residue, i.e. not residue
modifications?

Structure.renumber_residues:
1. OK, I see what you're doing with het_seed=0 -- clever, but maybe
more clever than necessary. It's not obvious from reading just this
code that the first iteration is a special case for HETATM numbering;
a maintainer would have to look at Chain.py too. A comment about that
would help, I think.
2. Why (h/1000 >= 1) instead of (h >= 1000) ?
3. If the Chain.renumber_residues arguments change to 'res_init' and
'het_init', then 'seed' here should change to 'init'
4. The arguments 'sequential' and 'chain_displace' seem to interact --
I don't think I'd use chain_displace != 1 unless I had set
sequential=True. So, it seems like chain_displace should only take
effect if sequential=True (i.e. line 77 would be indented another
level). To tighten things up further, I'd combine those two arguments
into a single 'skip/gap_between_chains' or similar:

# UNTESTED
for chain in model:
    r_new, h_new = chain.renumber_residues(res_seed=r, het_seed=h)
    if skip_between_chains:
        r = r_new + skip_between_chains
    if h_new >= 1000:
        # Each chain's HETATM numbering starts at the next even 1000*N
        h = 1000*((h_new/1000)+1)
    else:
        h = h_new + 1

Structure.build_biological_unit:
It looks like if the structure has more than one model, the models
after 0 will be clobbered when this method is run. So, unless a better
solution appears, it's safest to add an assert for len(self.models) ==
1 or check+ValueError for multiple models.

All the best,
Eric