[DAS] DAS for protein structures

Fri Jul 23 07:20:44 EDT 2004

Hi Andreas,

Some comments on the proposal

> To read more details please access the specification at
> http://www.sanger.ac.uk/xml/das/documentation/

 > The SEQRES protein sequences, which is contained in a  PDB file, can 
be
 > different to some extent.

Might want to link to the PDB docs for the SEQRES records.  You
can also find hetereogens (non-ATCG) in the sequence, and it
mentions one of my favorite words in the docs - microheterogeneity.

 >  There can be  negative positions, the order of the numbers
 > does not need to be  linear, there are alternative locations
 > possible (indicated by "A",  "B"),

In this case I suspect the "A" and "B" are insertion codes and
not alternate locations.  The latter is used when it appears
an atom can be in one of multiple positions, as I recall.

 > All orientation arguments that are used in various services
 > are becoming optional, since orientation is related to the
 > orientation along the DNA and is not needed for proteins.

Isn't it still required for nucleotides and ignored for protein?
Otherwise as you state it the orientation parameter is also
optional for DNA.  Is "orientation=+" or "orientation=" equivalent
to an unspecified orientation parameter when the sequence is
a protein?

 > depreciated

"deprecated"

 > "is re-established again."

"is re-established."  Unless this is the second time it's been
re-established?

 > "The ref is argument has "

"The ref argument has "

 > It has a version  number (required) in the form "N.NN"

Define "N.NN".  Does this mean there can be only 1000 versions?
Why the limit?  Why not \d+\.\d+ or \d+(\.\d+)?  ?  Should there
be a meaning to the two parts of the version?  Should be always
be an increasing value?  Isn't the version information captured
elsewhere?

 > Whenever the DNA of the entry point changes, the version
 > number should change as well.

"Should"?  Or "must"?

The entry_points optional attribute "href"
 > echoes the URL query that was used to fetch  the current document.

I don't understand the need for this.  If it's important, it won't
work in some environments because the client's request might be
   http://some.host/x/y/z

where the machine "some.host" forwards the request to another machine as
   http://another.host/prefix/x/y/z

which does the actual work.  The machine "another.host" is on
its own local DNS which isn't visible to the outside world.  Since
the internal machine doesn't know the original URL used by the
client it can't pass back a valid URL.

 >  For compatibility with older versions of the specification, the
 > <SEGMENT>  tag can use a size attribute rather than start and stop,
 > and  can omit the orientation attribute

Can "size" be used in addition to start/stop as a transition from
the older version to the newer one?  If omitted, is the orientation
equal to "+"?

 > This query returns one or all alginments

"alignments"

Under the <dasalignment> XML you have
 > (required; one only)  >The doctype indicates which formal DTD
 > specification to use.  For the dna query, the doctype DTD is
 > "http://www.biodas.org/dtd/dasdna.dtd".

Is that a bad copy&paste from the previous spec?

 > subject (optional; one or more) the id of the alignment - subject.
 > To get a list of available alignments for query use the entry_points 
request.

If there is more than one subject, how is the parameter constructed?
Is it comma separated?

 >  (required) version of Object. e.g. CRC64 checksum for protein 
sequences.

Why is this version not in the form N.NN?

Why is CRC64 suggested?  (md5 is better.)  Why only for protein 
sequences?

 > attribute:intObjectId
 >
 > (required) internal, unique name name for this object.  This is used 
in the
 > SEGMENT section to identify to which object an alignment belongs to.

The prefix "int" is confusing.  Even "internal" is confusing -- internal
to what?  What about "sequenceId" since all the objects are sequences?

attribute:type

 > (optional) a type for this object.e.g. DNA, PROTEIN, STRUCTURE, etc.

Who defines "etc."?  What about "RNA"?  "ssRNA"?  "tRNA"?  Is the
case important?

The example you give includes

<alignment>
   <alignObject dbAccessionId="someid" objectVersion="version"
                intObjectId="internalId" type="objectType" 
dbSource="someSouce"
                dbVersion="version" dbCoordSys="coords"  >
     <alignObjectDetail dbSource="someSouce" property="property">

Please move "dbAccessionId" to be with the attributeGroup:dbRef
terms, to make it easier to compare the outline with the documentation.

Could you give snippets from a real example?

 >        <score methoName="scorename" value="scorevalue">

"methodName"

 > attribute:dbCoordSys
 > (optional). The co-ordinate system used by the database. This
 > is not always the same as the database. For example, Pfam uses 
UniProt ...

How is this specified?

 > Clients generally should use the DAS - SEQUENCE request to get the 
seqeuence,
 > so this is optional

If it's optional then why have it here?  As defined, all clients must 
understand
how to get to the DAS - SEQUENCE since they cannot assume the server 
supports
returning the sequence here.  And btw, it's "sequence" not "seqeuence".

 > attribute:property

What are the defined property values for an alignObjectDetail?  Also, 
fix up
the formatting for this example.  Also, "CDATA" refers to unescaped
character content while I think you mean "element content".

 > attribute:methodName
 > (required) the name of the score, e.g. number of equivlanet
 > residues (eqr), e-value, etc.

what about "scoreType"?  Do you have an enumerated list?  Are all of the
values expected to be a number?  If so, is there a restriction to the
range of the number?  Are IEEE754 exceptional values, like NaN or Inf
allowed?

 > Element:<geo3D>

You use the "cigar" string because it provides an "efficient way to
encode an alignment" but then you don't provide an efficient way to
encode the rotation matrix.  Two possibilities are:
   - it's orthonormal so only include the upper/lower triangle
   - use comma separated values

You don't say if the vector transformation occurs before or after the
rotation matrix.  Nor do you say which structure gets the 
transformation,
since it only states:
     this  section defines how one of the needs to be shifted and rotated
     in order to be superimposed with the others.

Couldn't you just write this as a (perhaps flattened) homogenous
transformation matrix simplified because you know it's only going
to be used for rigid body transformations?

The result would look like:
   <geo3D intObjectId="xxx">r11,r12,r13,r22,r23,r33,t1,t2,t3</geo3D>
and be much more succinct than what you have now.

Under "Retrieve 3D coordinates".

If the chain is not given is it assumed to be equivalent to
the chain " "?  All PDB residues have a chain, and space is allowed
for a chain id.  Or does unspecified chain mean get the first chain?

Since "one or more" chain ids are allowed, how are the given?  Comma
separated values?

Where do I find the number of models in the structure?  According to
the docs it implies it can be found from entry_points ("The same
applies to a  structure server where entry_points returns the list of  
available chains and models.")  I don't see that field described.

How do you support the alternate location identifier?  Just ignore it?
Return all locations for a given atom?

Why do you define your own XML format for 3D structure?  What about
basing it on, say, CML?  Or why not just feed a PDB file back, perhaps
embedded inside of XML?  After all, no structure program is going to
handle your XML format.

If you do want to roll your own, there are many things to fix.  Here
are several:

 > attribute:groupID
 >
 > (required) the PDB code of the amino acid. e.g. 25,26,27A
 >
 > attribute:insertCode
 >
 > (optional) insertion code for amino acid. e.g 86A, 86B

Okay, which is the group ID and which is the insertion code?  First
should be a number (-2, 0, 26) and the insertion code is a
character.

 >         <connect type="connectionType">
 >                <atomid atomID="atomID"/>
 >        </connect>

Two atoms make a connection.  Where's the other atomID?  Also, in
some places you have "Id" (as "dbAccessionId") and in others
you have "ID".

Are only covalent bonds important?  What about HYBND records?

You also ignore the anisotropic B-factors and other bits of data
which may be in the PDB file.  For example, waters on the symmetry
axis of a crystal structure may be denoted by an occupancy value
of 1/symmetry count.  (See the comments for 2PLV.)

And you're missing the crystal information.

It's 5am here so my apologies if any of the above sounds overly
terse or confusing.

					Andrew
					dalke at dalkescientific.com