[DAS] DAS for protein structures

Andrew Dalke dalke at dalkescientific.com
Sun Jul 25 16:07:10 EDT 2004


Andreas:
> Here the idea is to reduce the PDB file to the minimal data
> needed for visualization, i.e. coordinates of atoms and their  
> connections.
> The biological data that is projected onto the 3D structure by a  
> client is
> retrieved via DAS - Feature and Alignment services.

What is "the minimal data needed for visualization"?  The most terse
file format I know is the XYZ format, which has X, Y, Z coordinates
and element type.  Everything else about the structure can be
derived from that either through quantum mechanics or through
empirical methods.

Humans want more than that, like residue name, chain id, and segment
name (I don't think your spec had the last).  Some people want
to see how the structure fit in the crystal, eg, to see if a
given feature is more an aspect of crystal packing forces.  Some want
the secondary structure annotation information (HELIX and SHEET)
while others are just fine with automated means.

By saying you're only going to support a subset of what's in the
PDB you're saying that those other portions aren't important enough.
But they are, or could be for some people and some structures.


>> After all, no structure program is going to
>> handle your XML format.
>
> I guess no structure program is capable of doing ANY - DAS  
> communication at
> the moment.  That's what we try to provide - missing services to apply  
> DAS in
> the structure world. If you are developing a Java program  (I know you  
> are a
> Python guy, but still ;-)  , making it DAS enabled  is quite simple.  
> There is
> support for the new  DAS commands in Biojava. e.g.:

But it's a lot easier to get an existing Java structure visualization
library to support a PDB file than to support your new format, or
your biojava structure object.  For example, suppose I want to use
Jmol or Marvin as my viewer -- how hard would that be using your API?

I see the Biojava structure object supports reading the PDB format
but it doesn't capture all of the data so going through it to
read the DAS result then generate a PDB formatted string to pass
to another library will cause some data loss.

There are many sources of data loss.  For example, I see you
support the x-ray resolution field, but it turns out that the
documentation isn't correct.  It isn't a simple float because
a resolution of "1.20" is different than one of "1.2".  There are
a few other places like that.  And you don't support PDB version
1 files, nor extensions like XPLOR's serial numbering extension
where the first digit can roll over to A (as in 99999, A0000, ...)
for supporting more than 99999 atoms.)

> To get a Biojava structure object via DAS
>
> String server =  
> "http://das.sanger.ac.uk/das/structure/structure?query=";
> DASStructureClient dasc = new DASStructureClient(server);
> Structure struc = dasc.getStructure(pdbcode);	

Suppose you instead returned

<structure type="chemical/x-pdb">
HEADER    IMMUNOGLOBULIN                          16-JAN-92   XXXX
TITLE     2.9 ANGSTROMS RESOLUTION STRUCTURE OF AN ANTI-DINITROPHENYL-
TITLE    2 SPIN-LABEL MONOCLONAL ANTIBODY FAB FRAGMENT WITH BOUND
TITLE    3 HAPTEN
  ...
ATOM
  ...
END
</structure>

The API wouldn't change at all.  The implementation would, but
not the API.

Or suppose you instead used a more ReST-ful format which returns

<structure href="some/other/url" />

Then that href lookup could be cached, or translated into a local
fetch, or pointed to RCSB's PDB server.  It could also support
things like content negotiation to return a PDB vs. CML vs. other
file format, at the desire of the client.  (Though con-neg is
still more a hope of mine than something actually used.)

In any case, the API would be identical to what you propose.
The format is just that, a format.  There must be something
to convert it to a Biojava API whether that format be this
new XML one, PDB or mmCIF.  You API hides the conversion layer,
so it's invisible to the application code no matter the format.

>> You use the "cigar" string because it provides an "efficient way to
>> encode an alignment" but then you don't provide an efficient way to
>> encode the rotation matrix.
>
> Yes, but the matrix does not take much space, so it is not really an  
> issue. An
> alignment in contrast can be quite big, so the cigar encoding saves a  
> lot of
> space.

Then don't even worry about it as a space issue.  Just give the
4x4 homogenous transformation matrix.  Anyone doing structure work
should have libraries for handling coordinate transforms like this,
and it's much more elegant than having several different element
types (for both the matrix and vector).

I'll still argue that you should use a format like
<geo3d>m11,m12,m13,m14,m21,m22,m23,m24,m31,m32,m33,m34,m41,m42,m43,m44</ 
geo3d>

rather than

         <geo3D intObjectId="intObjectId">
                 <vector x="xCoord" y="yCoord" z="zCoord"/>
                 <matrix>
                         <max11 coord="float"/>
                         <max12 coord="float"/>
                         <max13 coord="float"/>
                         <max21 coord="float"/>
                         <max22 coord="float"/>
                         <max23 coord="float"/>
                         <max31 coord="float"/>
                         <max32 coord="float"/>
                         <max33 coord="float"/>
                 </matrix>
         </geo3D>

It's just so much easier for implementers to read a single
vector of numbers into a 4x4 matrix than to read your format.

What is your criterion for determining the space vs.
implementation costs overhead?  Why wouldn't

<geo3D intObjectId="intObjectId" x="xCoord" y="yCoord" z="zCoord"
   r11="float" r12="float" r13="float" ... r33="float" />

be even more concise and readable?

Another option is to consider how the SVG spec handles the
same problem, though it is in 2D instead of 3D.  Here are
a few examples I found:

<g transform="translate(-10,-20) scale(2) rotate(45) translate(5,10)">

<g transform="translate(-10,-20)">
   <g transform="scale(2)">
     <g transform="rotate(45)">
       <g transform="translate(5,10)">

<g transform="matrix(1 0 0 1 10 -3)">

The last is the closest to what I'm proposing.  (The earlier
ones are harder because the rotation can be around different axes.)

That suggests an even nicer encoding as

<geo3D intObjectId="intObjectId"
     matrix="r11 r12 r13 r21 r22 r23 r31 r32 r33 t1 t2 t3" />

(or use the full 4x4 matrix).  Terse, consise, easy to support.
What's not to like about it?


>> Why is CRC64 suggested?  (md5 is better.)
>
> This is the checksum provided by Swissprot.

But why is it suggested?  Why not just leave it as

 > attribute:objectVersion
 >
 >  (required) version of Object

and don't make any recommendation for how to construct the
checksum.  Better would be to make some functional description
of the version, like "must change when the sequence changes"
for the weak version you have, or "must be a positive integer
which increments when the sequence changes" for a strict version.

BTW, as written the objectVersion can be identical to the
protein sequence itself.  Is there a limit to the size of
the version string?

The SWISS-PROT record also keeps the timestamp for the
last change of the protein sequence.  What about using that
field instead?  Not that I want to mandate that one, but I
offer it as another value which meets your spec, and seems
more appropriate.

Do you know about the Chemistry Development Kit
(http://sourceforge.net/projects/cdk/ ) or Joelib
(http://www-ra.informatik.uni-tuebingen.de/software/joelib/index.html )?
They are two other open-source chemistry libraries for Java and may
contain code or techniques you all can draw from.

					Andrew
					dalke at dalkescientific.com



More information about the DAS mailing list