[Biopython] Overhauling of Bio.PDB module

Thu Oct 17 07:04:19 UTC 2019

Hi

It’s great to hear that you are updating the biopython’s PDB module. 

Just a reminder – PDB files are considered legacy format by the wwPDB, the primary format is mmCIF. There are an increasing number of PDB entries which do not have a PDB format file. So, if you are fetching files from the wwPDB FTP you should be getting the mmCIF format file. 

Also, MMTF will be replaced by binary CIF in the not too distant future

https://github.com/dsehnal/BinaryCIF

Binary CIF will be used by RCSB’s and PDBe’s new viewer Mol*(https://molstar.org/) and will be served by both RCSB and PDBe’s coordinate servers

https://www.ebi.ac.uk/pdbe/coordinates/index.html

Regards

John

From: Biopython <biopython-bounces+jmb=ebi.ac.uk at mailman.open-bio.org> On Behalf Of Joe Greener
Sent: 16 October 2019 23:23
To: biopython at biopython.org
Subject: Re: [Biopython] Overhauling of Bio.PDB module

Hi João,

I hadn't seen your reply when I wrote mine (spam filters, grr) but it appears we are broadly in agreement.

I agree that Bio.PDB's USP is its general parsing and structure handling functionality. I guess there is a "build it and they will come" argument for making the spatial stuff fast too.

Long term Bio.Structure is probably a better name anyway as we now parse mmCIF and MMTF as well as PDB files. And it would allow us to sort out the unholy mess of imports and module/class name clashes that Bio.PDB has accumulated over the years.

Best,
Joe

Joe Greener
Research Associate, UCL
http://jgreener64.github.io

On 16/10/2019 18:14, João Rodrigues wrote:

Hi Joe,

IIRC from BOSC, my proposal was to work under a new namespace 'Bio.Structure' to avoid compatibility issues and, on the long term, deprecate Bio.PDB once all functionality had been rewritten.

It would also be interesting to gauge what would be features people (users and developers) would like to see implemented/changed/fixed/removed.

The old car analogy is perfect :)

Cheers,

Joao

Joe Greener <jgreener at hotmail.co.uk <mailto:jgreener at hotmail.co.uk> > escreveu no dia quarta, 16/10/2019 à(s) 15:08:

Hi Patrick,

Some of us spoke about this at CoFest too, inspired by the ideas in Biotite (I don't think you and I spoke at BOSC though). As I recall it was João, Spencer, myself and possibly Peter in the discussions.

We were in favour of the fundamental idea of a large coordinate array that is indexed into. As you point out though it would be no small amount of work to implement. I personally won't have time to do it, though I am happy to discuss and review code.

I view Bio.PDB like a beloved older car that has been patched up over many years. It is probably the most widely used and debugged PDB parsing code around, and any overhaul would have to make sure to maintain the behaviour that many people rely on. That said, it does have its peculiarities and is rather slow (https://github.com/jgreener64/pdb-benchmarks). I'm just saying that we should make sure to get consensus before merging any overhaul PRs. But for sure I am in favour of someone making those PRs.

Best,
Joe

Joe Greener
Research Associate, UCL
http://jgreener64.github.io

On 16/10/2019 12:37, Patrick Kunzmann wrote:

Hello Biopythoneers, 

at the BOSC this year we talked about overhauling the Bio.PDB module. The problem is that currently the atom coordinates are stored in a separate NumPy array for each atom. This design prevents efficient computation of all kinds of analyses (distances, angles, superimpositions, etc.). One proposed possible solution to this problem, we talked about, was to put the coordinates of the entire structure in one NumPy array, and let the Atom, Residue, Chain and Structure objects point to positions in this array. The benefit of this approach is that functions could be directly applied onto the entire array, harnessing the power of vectorization. 

For the analysis we could adapt the vectorized functions from the Python package Biotite, a project I am currently working on (https://www.biotite-python.org/apidoc/biotite.structure.html). Usually, these functions already accept the coordinates as NumPy array, so I think only a few tweaks would be necessary for every function. 

However, we would require one person or a small team who makes the effort to implement the new structure types and adapts the analysis functions. I could offer a pair of helping hands in the adaption of the analysis functions, but I don't have the time for anything more. 

So the question is: Is there anyone out there, who is willing to do this work? Alternatively, I would propose to write a 'bridge' package between Biopython and Biotite, that converts the Biopython structure representation into the representation in Biotite and vice versa. I think, this solution is less elegant but would also require less effort. 

Best regards 

Patrick Kunzmann 

_______________________________________________ 
Biopython mailing list  -  Biopython at mailman.open-bio.org <mailto:Biopython at mailman.open-bio.org>  
https://mailman.open-bio.org/mailman/listinfo/biopython 

_______________________________________________
Biopython mailing list  -  Biopython at mailman.open-bio.org <mailto:Biopython at mailman.open-bio.org> 
https://mailman.open-bio.org/mailman/listinfo/biopython

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20191017/e8f66952/attachment-0001.htm>