[Biopython-dev] Features of the GSOC branch ready to be merged

Wed Jan 26 00:41:42 UTC 2011

Trying to reply point per point both to Eric and Kristian.

Hi João,
>
> Good stuff. I see you made a nice clean revision history for us, too --
> thanks!
>

I'm still trying to get the hang of Git but since I learned 'git reset' life
is easier :)

>
> Whitespace:
> Some extra spaces crept in and are throwing off the diff in
> Structure.py. Also, Structure.remove_disordered_atoms has a bunch of
> blank lines in the function body; could you slim it down?
>

They were mostly in the biological_unit function I suppose. I cut those
down. I've also diff'ed the master and mine and checked differences in
whitespace. Almost all gone, those left I can't really erase them, they're
probably from my editor...

>
>
> Chain.renumber_residues:
> 1. How about 'res_init' and 'het_init'? When I see "seed" I think RNG.
>

Agreed, changed.

> 2. It looks like het_init does numbering automagically if the given
> value is 0, and otherwise numbers HETATMs in the expected way starting
> from the given number. How about letting "het_init=None" be the
> default, so "if not het_init" triggers the magical behavior, and if a
> positive integer is given then go from there. (Check that het_init >=
> 1 and maybe het_init == int(het_init) to allow 1.0 to be coerced to 1
> if you want.)
>

I'm using this later on to allow incremental chain renumbering. That's why
it's 0 and not False or none, because I then add a number to it and it
starts from that number on. I guess you understood when you read
Structure.py. I'll add a comment pointing this out.

> 3. I see in the last commit you added a local variable called 'magic'.
> <suspicious_look /> Could you find a better name for that? I think
> 'guess_by_ca' would fit, if I'm reading the code correctly.
>

Changed 'magic' to 'filter_by_ca'. What it's doing is filtering the HETATMs
if they have a CA atom so I guess it's a good name for it.

> 4. In the last block (lines 170-174 now), could you add a comment
> explaining why it would be reached? Before this commit there was a
> comment "Other HETATMs" but I'm not sure I fully understand. Is it for
> HETATMs not associated with any residue, i.e. not residue
> modifications?
>

Added. It's for all HETATMs that don't have a CA atom basically and that are
not contemplated in SEQRES (if there).

Structure.renumber_residues:
> 1. OK, I see what you're doing with het_seed=0 -- clever, but maybe
> more clever than necessary. It's not obvious from reading just this
> code that the first iteration is a special case for HETATM numbering;
> a maintainer would have to look at Chain.py too. A comment about that
> would help, I think.
>

Added.

> 2. Why (h/1000 >= 1) instead of (h >= 1000) ?
>

Accumulated frustration over one day results in such logical typos :)

> 3. If the Chain.renumber_residues arguments change to 'res_init' and
> 'het_init', then 'seed' here should change to 'init'
>

Done.

> 4. The arguments 'sequential' and 'chain_displace' seem to interact --
> I don't think I'd use chain_displace != 1 unless I had set
> sequential=True. So, it seems like chain_displace should only take
> effect if sequential=True (i.e. line 77 would be indented another
> level). To tighten things up further, I'd combine those two arguments
> into a single 'skip/gap_between_chains' or similar:
>
> # UNTESTED
> for chain in model:
>    r_new, h_new = chain.renumber_residues(res_seed=r, het_seed=h)
>    if skip_between_chains:
>        r = r_new + skip_between_chains
>    if h_new >= 1000:
>        # Each chain's HETATM numbering starts at the next even 1000*N
>        h = 1000*((h_new/1000)+1)
>    else:
>        h = h_new + 1
>
>
I changed it to consecutive_chains and refactored the code a bit. I also
changed the increment of the het_init value. This way, having more than 9
chains for example would lead to residue numbers over 10000 which is not
allowed. I solved it by making all HETATMs starting (by default) at 1000 and
just incrementing. If the numbering is consecutive they are also affected by
the value chosen to skip between chains. A bit more logical IMO.

Answering Kristian's suggestions:

- When renumbering a chain, the id's of some residues are changed. Have
> you tested whether the keys in Chain.child_dict are changed as well?
>

Good question... they didn't... is there an easy way of rebuilding that
dictionary? Or should I just "rebuild" it and then overwrite child_dict?

- Could you refactor a method Chain.change_residue_numbers(old_ids, new_ids)
> that does the changing of the calculated identifiers? I think this would
> have a some advantages (shorter code is more testable, easier to deal with
> the point above, I could use this for some custom numbering schemes).
>

Could you elaborate on this? Should it be a new method?

- Currently, Chain.renumber_residues in the lines
>      last_num = residue.id[1]+displace
>      residue.id = (residue.id[0], residue.id[1]+displace, residue.id[2])
>  are repating 3 times.

Changed. I merged the if-clauses. A bit more complicated but only one
if-else condition.

> Structure.build_biological_unit:
> It looks like if the structure has more than one model, the models
> after 0 will be clobbered when this method is run. So, unless a better
> solution appears, it's safest to add an assert for len(self.models) ==
> 1 or check+ValueError for multiple models.
>

I would prefer to return a new Structure object just with the Biological
Unit. It would save me the deepcopy but I'd have to create a new object so
dunno if I could gain some speed there. But this would actually make more
sense and avoid that problem. What do you think?

I pushed again to the same branch:

https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements