[Biopython-dev] Optimization of PDBParser and friends

Thu Sep 6 05:52:34 UTC 2012

Hey,

Which Python was that? i.e. The OrderedDict from the standard lib
> (which I hope is optimised), or the back port (which might be slower).
>

Both. I also found it strange and
googled<http://stackoverflow.com/questions/8176513/ordereddict-performance-compared-to-deque>it.
Apparently OrderedDict is pure python, not C like dict, thus the
difference.

That seems risky - but see if you can sort out what is happening
> with the unit tests (below).
>

What Bio.PDB does right now is rely on the list to iterate over things.
Thus, you get the order in which you read the PDB file. However, if you
sort it using the several Objects sort method you will get the following
rules:

Atom.py - N CA C O first, then alphabetically
Residue.py - First aminoacids and nucleic acids, then heteroatoms.
Chain.py - Empty chains last.

These are already in place somewhere in the code. I just used them to
overload the __cmp__ method, with a couple of additions because I
personally disagree with the following:

Atom.py - Inorganic atoms should come out last. For simplicity.
Residue.py - If the PDB order is 151 MSE, 152 VAL, 153 CYS, you should get
in return when you iterate: 151, 152, 153. Right now you get 152, 153, 151.
PDB files already have weird large numbers for water and ions for example,
so these come out last anyway. Pushing all HETATMs to the end will
sometimes disrupt the "natural" order of things, for instance modified
residues. Magic perhaps :)

I sorted out all relevant issues with the unittests. I had a small problem
with build_peptides because of this HETATM last rule, so I took it away and
now it works. All tests pass except 4: 2 because of the header, which is
not read decently right now, and 2 because of the ordering which is
explicit in the assert statement of the test. So it's a matter of changing
these assertions and they will work.

It would also look less like Java code ;)
>
> I like this plan - but initially define and document the new properties,
> and deprecate the old get/set properties. Without that you'll break
> almost every PDB using script out there.
>

How do I deprecate the old ones? Is there a DeprecationWarning or so?

Just a reminder, if you want to test/check the code, it's on my
github<https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements>
.

Cheers,

João