[Biopython-dev] Optimization of PDBParser and friends

João Rodrigues anaryin at gmail.com
Mon Sep 3 22:07:39 UTC 2012


Hi all,

A quick update on some latest work. I found some time to finally work a bit
on the PDB parser and Bio.PDB in general. I started by optimizing the
current code. I ran cProfile on script that parsed a set of structures
without header and without element columns. I did this because one of the
optimizations rendered the current header parser useless.. (replaced the
PDB file handle by an iterator instead of using the readlines method). I
still need to work a bit on the memory leak, but for now it seems pretty ok
(parsed 400-ish large structures without a glitch).

I am attaching two pictures of cProfile and the two output files. There is
a nice improvement of about 25%, but this can still be improved for sure. I
just replaced some methods here and there, pre-initialized the numpy
arrays, etc.. I pushed this version to my github pdb_enhancements
branch<https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements>
.

One big change I would propose is to eliminate the duality
child_list/child_dict. I think that keeping child_dict and generating
child_list from sorted dict keys would be good enough. OrderedDict also
looks appropriate, but it's Py2.7+.. Still need to look into this, but by
looking at all those "append" methods in the profiling it hints at a nice
speed up, and also at much cleaner code.

Let me know of your opinion if you have some time,

Cheers,

João

PS. Attached complex_1.pdb as an example of the structures in the dataset
used for this particular test.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-master-TBEV.png
Type: image/png
Size: 166144 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-master-TBEV.profile
Type: application/octet-stream
Size: 252112 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-optimized-TBEV.png
Type: image/png
Size: 148137 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-optimized-TBEV.profile
Type: application/octet-stream
Size: 273487 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0005.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: complex_1w.pdb
Type: chemical/x-pdb
Size: 649559 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment.bin>


More information about the Biopython-dev mailing list