[Biopython-dev] Optimization of PDBParser and friends

Wed Sep 5 20:24:23 UTC 2012

Hello all,

Some news.

A. The OrderedDict implementation is quite slow. It essentially slows down
the parser by 30%, rendering all the improvements I had done moot.
Therefore, although it's a great idea, a major reason for these updates is
speed so I think it might not be worth it.

B. As an alternative to this, I implemented the following. Entity has now
only child_dict, and is a general dictionary. However, each Object (Model,
Chain, Residue, Atom) gets their own __cmp__ method overloaded with the
information in the "_sort" methods that already existed. In this way, a
simple sorting of the values of the dictionary returns an ordered list. I
tweaked the Atom.__cmp__ to first sort N CA C O atoms and then
alphabetically. I also added that inorganic atoms such as Calcium come at
the end. This will make things a bit nicer when Calcium is involved for
example. Finally, the only downside to this seems to be that we lose the
order in which residues are inserted. Ie. if residue 151 is the first of
the PDB file and all others range from 1-150, then this first 151 is going
to be placed at the end when you iterate. However, from my experience and
in my opinion, not only this is logical, but it also rarely happens in real
PDB files.

C. I am strongly in favour of removing most (if not all) set/get methods
and replace them by direct attribute access. For instance,
"atom.get_parent() --> atom.parent". Saves some space in the code and makes
things more transparent.

D. I edited the PDBParser to tweaks a few things, nothing major. The file
handle is now treated as an iterator throughout the parsing and it should
be more memory-friendly. The line counter is still preserved. I also added
a test to make the get_header argument actually work.

E. General things here and there that I can't just remember..

F. Unittests are breaking everywhere. Checking why, but it all seems
related to this sorting issue.

Cheers,

João