[Biopython-dev] Future of Bio.PDB

Wed Feb 19 16:22:54 UTC 2014

On Wed, Feb 19, 2014 at 6:54 AM, João Rodrigues <anaryin at gmail.com> wrote:

> From another thread:
>
> As for what I suggested. Since my GSOC period, already 4 years ago.., I
> > noticed that the PDB module is a bit messy in terms of organization. The
> > module itself if named after the databank, which can be confused with the
> > format name, the mmcif parser is defined inside in a subfolder and there
> > are application wrappers there too (DSSP, NACCESS). Besides this issue,
> > which is not an issue at all and just my own pet peeve, there is a lot
> that
> > the entire module could gain from a thorough revision. I've been using it
> > very often and some normal manipulations of structures are not
> > straightforward to carry out (calculating a center of mass for example,
> > removing double occupancies) due to the parser being slow and quite
> memory
> > hungry. In fact, trying to run the parser on a very large collection of
> > structures often results in a random crash due to memory issues.
> > I've been toying with a lot of changes, performance improvements, etc,
> but
> > I'm not satisfied at all with them.. somethings that i've been trying is
> to
> > have the structure coordinates defined as a full numpy array instead of N
> > arrays per structure (one per atom) or the usage of __slots__ to mitigate
> > memory usage (managed to get it down 33% this way). This would also go in
> > line with a suggestion from Eric a long time ago to make a Bio.Struct
> > module which would be the perfect "playground" to implement and test
> these
> > changes. Other developments that I think are worth looking into are for
> > example making a nice library to link a parsed structure to the PDB
> > database and fetch information on it using the REST services they
> provide.
> > I'd like to hear your opinion (as in, everybody, developers and users) on
> > this and if it makes sense to indeed give a bit of TLC to the Bio.PDB
> > module. Also, on what changes you think should be carried out to improve
> > the module, like which features are missing, which applications are worth
> > wrapping.
> > Just to kick off some discussion. Maybe a new thread should be opened for
> > this later on.
> > Cheers,
> > João
>
>
> As for the name of the module, yes, Bio.Struct is just the "legacy" name I
> remember.. Bio.structure would probably be better and more clear.
>

The p3d folks once offered to incorporate their work into Biopython:
http://www.biomedcentral.com/1471-2105/10/258

We had concerns about having p3d and Bio.PDB coexisting within Biopython,
but if someone wanted to emulate the Bio.PDB API on top of p3d, or
otherwise slip p3d's secret sauce into the Bio.PDB internals, that would do
the trick. (I have not thought about the details of how this would work at
all.) I think it should also be possible to replace p3d's custom query
language with the sort of tricks Bio.Phylo, pandas and SqlAlchemy do with
keyword arguments and generators to get the same results with Python
syntax.

Alternatively, there is the option of sticking with the Bio.PDB namespace
and adding only "read", "write" and "convert" functions to
Bio/PDB/__init__.py to make the basic usage of the module more similar to
the other Biopython sub-packages. The Model class could store one or
several NumPy arrays that cover all atom coordinates, and the Chain,
Residue, Atom and Interface classes would probably just store references to
that array, e.g. a shorter 1D array of integer row indexes.

Would either of these internal changes make it easier to apply the GSoC
work that's been done on Bio.PDB?

-Eric