[Biopython] RFC on WIP - internal coordinates, structure / prediction folk please read

Thu Apr 11 00:32:33 UTC 2019

Hi Rob,

Had a look at the code and it seems like a LOT of work in there. Thanks a
lot for the contribution!

I do have some concerns however, mostly with duplication of code and
features. Without getting into a lot of detail, I think as it is, the code
is more suited for a standalone module and not as part of Bio.PDB. I would
consider something within Bio.PDB to use more the existing classes
(Structure, Residue, Chain, etc) and perhaps extending them (making use of
object-oriented inheritance), but your code seems to create entirely new
classes. Another example, the new modules you wrote include code to
read/write data, but these seem to fit better in PDBIO.py (or another
xxxIO.py file). Another yet, a lot of the geometry calculations could be
packaged in the Vector.py module. Would it be unreasonable to ask you to
either refactor to code to minimize this apparent redundancy? Maybe there
is a way to simplify the code by having Hedron and Dihedron classes (like
we have Residue, Chain, etc) and having a ICBuilder that takes a Structure
object and converts it to another Structure in IC "space". Something that
could allow you to do this in code:

parser = PDBParser()
xyz_struct = parser.get_structure('1abc', '/home/foo/ 1abc.pdb')
ic_struc = ic.from_xyz(xyz_struct)  # returns a Structure
new_xya = ic_struct.to_xyz()  # returns a Structure

Would be happy to discuss these with you off the list, to minimize noise
for all other users. Otherwise, and depending on what others say, I would
be more in favor of including the code as a different module (e.g. Bio.PIC
or Bio.PDB.IC).

Despite seeming a bit negative, I am very interested in this code. I have
been toying with this sort of conversions for a very long time now but
never got round to implement them myself because of the non-triviality of
converting xyz to ics and back.

Cheers,

João

rob miller <rob.miller.gh at gmail.com> escreveu no dia terça, 9/04/2019 à(s)
11:26:

> Hi João,
>
> Sorry, no publication and not much documentation :-)   The latter I can
> add, for the former (a publication) it always seemed more like bookkeeping
> than science.  On the other hand a publication might have raised awareness
> of the lua version of this code that's been sitting on Github for 3 years
> now.
>
> The approach is as follows:
>
> - A hedron is a 3-atom unit defined by 2 bond lengths and an angle
> between.  A dihedron is two hedra with one overlapping bond length and a
> specified dihedral angle.
> - generate lists of all hedra and dihedra making up protein backbone and
> sidechains (like psi is N-CA-C-N+1. etc.).  Plan the lists so they overlap
> in the right way to extend atom chains for the assembly step.
> - read through the pdb structure and form the hedra and dihedra objects
> for each residue from the atom coordinates; residue i will include the
> hedra and dihedra which extend into residue i+1
> - at the beginning of the chain and any chain breaks (determined by
> peptide C-N bond exceeding a cut-off), store the X, Y, Z coordinates for
> the N, CA and C atoms to restart the chain.  We lose atoms in the case of
> chain fragments that don't start with N-CA-C.  (Storing these is what links
> the internal coordinates to the coordinate space of the PDB file)
> - at this point we have the internal coordinate set.  Re-writing the code
> to work with Biopython (this is at least my 3rd iteration since grad
> school) got me to better support hydrogens and altlocs, so the .pic
> datafile captures Biopython disordered residues/atoms including occupancy,
> plus B-factors.   The B-factors are definitely just tacked on at the end
> for each residue, but it does make for complete ATOM records when
> regenerating the PDB data.
> - the assembly process uses a deque (double-ended queue) loaded up with
> initial hedra containing atom coordinates. For all the dihedra that a
> hedron from the queue starts, work out the coordinates for the 4th atom,
> and put the 2nd hedron at the back of the queue. This way we work though
> the correctly formed list above, and is explained in more detail in the doc
> at the start of PIC_Residue.assemble().
>
> The -hard- part is handling (or at least not crashing on) the altlocs,
> disordered atoms, etc.
>
> Rob.
>
>
> On Tue, Apr 9, 2019 at 5:39 PM João Rodrigues <
> j.p.g.l.m.rodrigues at gmail.com> wrote:
>
>> Hi Rob,
>>
>> This is an extremely interesting contribution. Thanks for the time you
>> took to write this. It would be very useful to have a conversion from
>> cartesian to internal coordinates, as far as I know, there isn't a tool
>> that does that (they usually do one way or the other). What is the
>> convention you use for the conversion? How do you handle say, chain breaks
>> or gaps? Is there a publication or small write-up you can provide about
>> this (maybe it's in the documentation and I missed it)?
>>
>> Cheers,
>>
>> João
>>
>> rob miller <rob.miller.gh at gmail.com> escreveu no dia terça, 9/04/2019
>> à(s) 06:09:
>>
>>> Hi,
>>>
>>> This is a request for feedback on code I would like to contribute to
>>> Biopython.  I want to do more cleaning, testing, polishing and documenting
>>> before making a pull request, so don't worry yet :-).  This post is to
>>> query whether there's sufficient interest to accept the facility into
>>> Biopython when I'm ready, hopefully gather some positive feedback and
>>> ideas, and to make it publicly available now as it's working for me for
>>> most structures.
>>>
>>> This branch ( https://github.com/rob-miller/biopython/tree/rtm-pic )
>>> adds infrastructure for internal coordinates under a .pic attribute on
>>> Bio.PDB Chain and Residue objects.  'Internal coordinates' means phi, psi,
>>> omega, chi<X> dihedral angles, all bond angles and bond lengths.  Internal
>>> coordinates can be read from a PDB structure and used to regenerate
>>> identical coordinate PDB chains (HETATMs not withstanding, although there
>>> is some support).
>>>
>>> While my primary application is to support structure prediction work,
>>> there are some useful side effects.  Probably most interesting is the
>>> ability to generate OpenSCAD files to 3D print protein structure models, as
>>> it uses the same algorithm for assembly of bond length, angle and dihedral
>>> angle data.  (Please be aware that the initial OpenSCAD rendering is
>>> reasonably quick, but the detailed rendering to generate an .stl file for
>>> printing can take hours depending on your hardware.)  Of lesser note,
>>> filtering options add support for removing Hydrogens from PDB structures,
>>> and obviously one can make Ramachandran plots and database projects looking
>>> at different subsets of chi rotamers and other aspects of protein
>>> structure.
>>>
>>> I've made a gist at
>>> https://gist.github.com/rob-miller/0be208b73fe2ab36fadeeef60831fc92
>>> to access the basic functionality.  Hopefully this is easy to get
>>> working in your hands, if you have a local pdb mirror there is a place to
>>> configure access to it near the beginning of the script.
>>>
>>> If you are playing with OpenSCAD and your protein has chain breaks (or
>>> you excised lysozyme from a GPCR), increase the -maxp cutoff in the gist
>>> options to treat the gap as an extra long peptide bond.
>>>
>>> All development so far has been exclusively on python3, so yes more
>>> versions to support.  I am aware of the related projects FragBuilder and
>>> PeptideBuilder.
>>>
>>> I hope you like it; please be gentle with me.
>>>
>>> Rob.
>>>
>>> _______________________________________________
>>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>>> https://mailman.open-bio.org/mailman/listinfo/biopython
>>
>> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> https://mailman.open-bio.org/mailman/listinfo/biopython
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20190410/2545a5bd/attachment.htm>