Bioperl: 3D biomolecular structure handling for Bioperl
Andrew Dalke
dalke@bioreason.com
Fri, 18 Dec 1998 23:14:02 -0800
Steve Chervitz <sac@neomorphic.com>
> We'd like to include small molecules, but the focus is really on
> biological macromolecules. Small molecules should be included to the
> extent that they interact with these macromolecules. So our focus here
> is more on structural biology than cheminformatics.
Sounds fine. That means for the first pass you probably don't
have to worry about
| bond order/type, cycle detection and aromaticity
(Some people will want to see that in their structures; eg, I
believe Rasmol does show bond order; but the others shouldn't be
as big a concern.)
> But how best to represent the data in the core format? XML maybe?
By this I take you to mean the representation of the exchange format,
as compared to the API representation. Here's a rough list of
criterion I have for a format, and reasons to want them:
* Extensible
There's no way to define everything for the first release, so
it has to be extensible to support future versions
* Graceful backwards compatibility
It should be possible for latter formats to be used by
earlier readers; eg, by making it so new items are ignorable
* NO FIXED FIELD WIDTHS! NO MANDATED MAX LINE WIDTHS!
There goes PDB and mmCIF. 'Nuff said.
* Unicode support
I want to be able to spell peoples' names correctly (with
accents, tildes, etc.), even in Klingon :)
<http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n1643/n1643.htm>.
BTW, it's often fun to browse unicode <www.unicode.org> to
realize just how many different recognizable symbols there are.
* Easy to use in the popular languages (Python, Perl, C++ and
Java seem to account for everyone). This knocks ASN.1 out.
It looks like XML is the best fit to this list. Are there any
other proposals?
CML should be a good place to look (some commentary I sent to
Peter Murray-Rust is coming in my next email).
> Thanks for the pointer. I notice that the VMD Programmer's Guide
> is version 1.1 draft dated 14 May 1996. How "off" is this from the
> current release?
Sadly, quite off. That was before the integration of Tcl into VMD,
which changed a lot of things, and in general there are a lot of
structural changes. It still describes the high level core reasonably
well, but more on the functional description (eg, how events are
managed) and not so much the detailed data organization. I don't find
it too hard to read the code, but then, I wrote a lot of it :) Some of
the details are covered in the "VMDTech" weekly posts I had on the
vmd-l mailing list.
The User Guide is much more up to date. One of the reasons is we
get much more feedback from people using the UG than the PG. Since
we've had perhaps 5 people out of > 10,000 downloads ask questions
about the internals, it's been hard to justify the work needed to
update that document, and a lot easier to just answer the questions
directly.
> These conformational and dynamic issues deserve special treatment.
> For some applications, the structure can be considered an immutable,
> single-conformation entity. Perhaps there could be DynamicStructure
> and StaticStructure objects, which could be inter-convertable.
That's one way to do it. My biggest concern is there seem to be
different types of people:
1) "small molecule" chemists that know the covalent bonding but not
the coordinates (and aren't often concerned with that)
2) quantum chemists that know the positions and compute which
bond should be interpreted as covalent
3) "large molecule" people, who assume covalent bonds never break
4) hybrid people (mixing quantum and classical MD) who mix 2&3
5) xtal and nmr people, who have special conventions like the
alternate location identifier.
6) other; people who do atom-like simulations but aren't really
atoms. (Eg, semiconductor simulations where the defects have
behaviour similar to atoms.)
I've been perhaps overly concerned with trying to deal with all
of these cases in a single system. I have to remember the design
constraint to focus on #3 and #5, which simplifies several things.
(Eg, you don't have to allow cases where the bond topology depends
on the coordinates, as with #2)
Knowing there are simplifications will call for redoing the
implementation in the future to support those cases. OTOH, there's
still a lot you can do, and is needed, within those constraints.
> [cycles in the data structures]
> I don't want to get to the point where memory efficiency in Perl
> is the guiding design principle of the structure object.
Well, the same problems apply in Python so that's not my only
constraint :) I mentioned C++ as being easier, because delete'ing it
called the destructor directly (instead of waiting for the ref count
to go to 0). I'm not so sure now. If there are many cycles, you have
to be very careful on how to delete, and the complexity grows as there
are more cycles. Most programs get away with it by either not
allowing editing or being *very* careful.
I've been considering a broadcast/observer architecture (where's
_Design_Patterns_ when you need the right name?) The idea here is the
basic molecule only contains the atoms and bonds, and a list of
observers (per atom and bond? per molecule?). An "observer" is like
a callback; it is notified when some event occurs, (is "broadcast").
In this case, I imagine the observers provide extra information about,
or behaviour to, the molecule.
There are different events that can happen to a molecule: add,
query, modify or remove an atom or bond. It would be useful to be
able to have a callback be notified when these occured. Consider a
"Residue" observer, which keeps track of the atoms in the residue, in
this case, Residue 38. The following is good as an example, but I
think not the best way to implement residue information
First:
find all atoms that should be in residue 38
store "38" in the objects "resid" field
register as an observer for each atom in the residue
Suppose you are the atom and want to get/set your "resid":
look in the atom object and find "resid" isn't there
go down the list of "query" observers asking for the "resid" data
--> the Residue has the information
<-- and it returns 38 (or a reference to the field for 'set')
return 38 (or the reference) to the caller (and perhaps cache the
observer to speed the next lookup)
(I believe the GoF book calls this a "Chain of Command" and I've
heard something similar called "Acquisition")
Suppose you delete an atom from the molecule:
the atom notifies its observers that it is about to be deleted
--> the Residue removes that atom from its list(s), thus removing
any possible links back to the atom and preventing cycles.
<--
Suppose you add an atom to the molecule. Here's two possible
scenarios:
1)
add the atom (doesn't know about "resid")
the molecule broadcasts the announcement "here's a new atom"
all the residues "observe" it
set the atom's "resid" field to 38
the atom doesn't know about "resid" so it looks for a callback
that can resolve that property
--> Residue 38 recognizes it
Add the new atom to the list of atoms in the residue
It (somehow) tells the other residues to stop listening
<--
(Hmm, that doesn't work if there's no existing Residue with that
resid so here's probably a better solution)
2)
add the atom (doesn't know about "resid")
the molecule broadcasts the announcement "here's a new atom"
create a single "UnknownResidue" object to observe it
set the atom's "resid" field to 38
using the standard lookup described earlier, get the UnknownResidue
--> See if the Residue already exists with that id
yes? inform Residue to request the proper event notifications
no? create a new Residue observer with info about this atom
Remove observation of the given atom to UnknownResidue
<--
If you assume that once an observer can handle a data item, it can
always handle it (until the observer is unhooked) you can cache a lot
of work and have relatively low performance penalty over built in data
structures since you amortize the acquisition hit over the number of
times it's needed.
I hope that made some sense. It's still thought in progress, but
this type of proposal should be useful for resolving some cyclical
memory problems, and offer a sort of extensiblity. This is the first
time I've written it down code-like, and it's given me some ideas.
> Thanks for the usage descriptions and link to your pdblang
> program. Did you manage to embed it inside VMD as you claimed to be
> working on in the comments?
As you look at it, remember it was my first perl script. I got
better at programming! Honest!
Everything except structure editing. Umm, let me check the list.
The following are not needed (either provided by Tcl or HTML pages):
version, info, commands, help, echo, define, undefine, call,
range, print, source, sourcepath, ls, cd, pwd, date, !
The following are available directly:
read, rotate, move, size, coord, select, mmult, rotmatrix, align
(most of these are vector/matrix transforms applied to 3d coordinates
so PDL looks like it's fine, needing perhaps some helper functions
like create the transformation matrix for a 10 degree rotation about a
given axis)
The following can be easily constructed from the available primitives:
copy, box, com, average
(the last three are simple in perl if you have access to the coordinates)
Somewhat-equivalents, but with different semantics:
merge, select, delete (by creating a temporary PDB file and
loading the results)
(here we start getting into molecular editing)
Not available in VMD (without in essence writing a Tcl program) :
renum, offset, rename, charmm22, replace
The "renum" and "offset" refer to the atom index, which VMD ignores.
The "rename" field lets you change, eg, the atom name. VMD doesn't
allow that since changing the backbone "C" to an "S" calls for a lot
of recalculation, and probably updating the display with the new data.
Andrew
dalke@bioreason.com
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================