Bioperl: 3D biomolecular structure handling for Bioperl
Andrew Dalke
dalke@bioreason.com
Fri, 18 Dec 1998 23:41:00 -0800
Lincoln <lstein@cshl.org> commented:
> I strongly support the idea of using XML for this and other elements
> of Bioperl. I wonder how well the CML (Chemical Markup Language)
> already fits these requirements?
I took a look at it after Peter Murray-Rust mentioned it as part of
a dicussion of mmCIF on the PDB mailing list. I sent some comments
to him but never received a reply. Following is a somewhat modified
and edited version of those comments. BTW, more information about
CML is at:
http://www.venus.co.uk/omf/cml/doc/index.html
= = =
I took a closer look at the XML work you've been doing and I've a set
of comments on it. They're mostly jotted down as I was going through
the pages and probably show my lack of knowledge in XML than anything
else. I hope they don't sound too brusque.
>From the FAQ.
> My first approach has been for a molecule to contain references to
> its oligomeric components, along with details of the covalent
> linkages and stereochemistry.
I disagree, but that's my background showing through. I've found that
the base data type of atoms with bonds connecting them, to be the most
useful, along with some sets (a residues is a set of atoms, a sequence
is an ordered set of residues) to help organize them. This is useful
for me since what I've done (molecular dynamics, structure analysis)
needed an atom centric viewpoint. Nature doesn't care if it's an
ASP-GLY dipepetide; it's a set of atoms.
> If ATOMNO is NOT given, the atoms are assumed to be numbered from
> 1...NATOMS in their occurrence in the ARRAY container.
I don't understand XML enough, but, can some atoms have an ATOMNO and
others not? What happens if there are duplicates? What does a
ZMATNOS value of "12 12 12" mean if there are three ATOMNOs of 12?
So shouldn't this be marked as being unique?
Also, why is the range start at 1 instead of 0? Is that the standard
XML base? My understanding is XML is an exchange document so is read
mostly by computer (or developers) and not end users. Thus, the
"natural" base is 0.
> ELSYM
Why are the element names constrained to the elements you listed?
For example, we have two types of dummy atoms; one for ring center and
one for putative hydrogen acceptor (by extension of the donor and
antecedent). My implementation uses two different names.
Why is there the duplication of things like "PARNOS" and "PARIDS"?
I'm not clear on why both number and an id fields are needed.
I don't understand chirality enough. The way Daylight does it is have
the center atom and the bound atoms (perhaps with a special hydrogen)
and some code like "OH25". By my understanding of PARITY term there
are only three ways to label the atom (+1, 0 and -1). I don't know
how to get 30 chiralities for an octohedron ("OH") out of that.
The PDB has a way to define which ligands (like water) should be
included in the molecular weight calculation. Is that useful here as
well?
I don't see mention of (my personal least favorite) "insertion code".
[Note for bioperl: my newest least favorite is the alternate location
indicator :]
How do you plan to store "real" charges, as compared to formal
charges? This is important for exchanging data for MD and QC
simulations. Same for a non-integral mass values.
Have you considered what needs to be done to store multiple
conformations? At present it appears you will need to do is as a new
molecule. Another way would be to have the coordinate data parallel
to the molecular structure. Then trajectories would be represented as
multiple arrays.
Of course, that leads to having the bond information be mutable over
the trajectory as well (for quantum chemistry), and charge information
(for doing free energy calculations) so at some point you need to yell
STOP!
BONDS
There's a lot of "yet to be ..." so one comment I can make is, the
conventions I use I learned from Daylight, but I know there are many.
(Eg, some people have orders like "1.5".)
> CYCLIC The cyclicity of the bond. +1 (acyclic), -1 (cyclic) and
> 0 (unknown, etc).
Isn't this a derived term? I just had to implement SSSR a few weeks
ago, so I'm pretty sure it is. What happens if the CYCLIC term
disagrees with information determined from bonding?
For that matter, how useful this information? Can you give an
example? I ask because knowing if a bond is in a cycle isn't too
useful if you don't know which cycles the bond is in. In the standard
example of a tetrahedral structure, every bond is in a cycle, but
there are only 3 cycles in 4 faces.
There are also the questions of what to do with storing force field
information (eg. bond, angle and dihedral force parameters). (I don't
know that field well enough to make more than a few comments on it.)
Other possibilities. We use DGEOM for conformation generation.
DGEOM has terms to say "fix these three atoms relative to each other"
or "keep the distance between these two atoms between 1 and 5 A".
These sorts of conformational restraints are also used in MD, but with
associated energy parameters.
FEATURE
There's no example (or definition) of "feature" in the on-line DTD
but there was one at
http://ala.vsms.nottingham.ac.uk/vsms/talks/chemweb/013.html
> <FEATURE DICTNAME="DISULFID" CONVENTION="SWISSPROT" START="31" END="96">
> <XVAR TYPE="ADDRESS">NODE12:31-96</XVAR>
> <XVAR>INTERCHAIN</XVAR>
> </FEATURE>
Is this really a range? Shouldn't it be the two residues involved in
the bond?
Also, you allow "subobject"s and "range"s in the definition of XVAR,
which "is normally the value of a NAME attribute". It appears to
place an implicit restriction on the NAME (eg, it cannot contain a
":").
The file http://www.venus.co.uk/omf/cml/doc/dtd/htmldoc/xvar.attr.html
under NAME has a link for more information about "name" to
http://www.venus.co.uk/omf/cml/doc/dtd/htmldoc/name.html which fails
for me with "File Not Found". As I recall, the only restriction I've
seen elsewhere is that there be no space character in the name.
Shouldn't these two definitions be the same?
What's the difference between "xaddr" and "address"? I'm not clear on
the distinction from the documentation.
FORMULA
I've asked this before on other things, but since the STOICHIOM can
be a derived property, what is the right thing to do when it disagrees
with the atom information? Also, shouldn't there be some way to say
"This FORMULA corresponds to this molecule/set of atoms"? Otherwise
what do you use when you have multiple molecules? (I suppose you
could used the SMILES notation with a "." seperator but you'll have
the difficulty of matching the SMILES component to the actual
compound.)
= = = =
As you all can see, I have a propensity for long emails :)
My summary to bioperl is, CML has some useful ideas. I worry
that it is not general enough for some of the data I am interested
in viewing, and I believe some of the data is redundant with no
specified behaviour on what to do if differences should arise.
Oh, I should also point out that NCBI's MMDB ASN.1 definition
does very good coverage of the descriptive information needed in
a structure file, as does the mmCIF work. I believe it would be
useful to use those as the basis for a broader version of CML.
Andrew Dalke
dalke@bioreason.com
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================