[DAS2] structure DAS

Fri Jun 3 18:50:47 UTC 2005

I've been thinking more about the general idea of a structure DAS.

I think it would be good to have someone with more recent
(and better) structure knowledge than I do.  This may be
the woman from RCSB mentioned yesterday.  Another idea is
Steven Brenner.

There are two main ways to think about proteins: sequence
and conformation.

The sequence model is similar to that used for DNA.
Sequences have residues arranged in a line, with positions
numbered by position.

The biggest database for this is SWISS-PROT.  Here's
an example of features

FT   DOMAIN      583    920       HECT.
FT   REGION      515    571       PABP-like.
FT   COMPBIAS    108    119       Asp/Glu-rich (acidic).
FT   COMPBIAS    158    181       Pro-rich.
FT   COMPBIAS    451    470       Arg/Glu-rich (mixed charge).
FT   COMPBIAS    479    488       Arg/Asp-rich (mixed charge).
FT   COMPBIAS    610    621       Asp/Glu-rich (acidic).
FT   COMPBIAS    858    878       Pro-rich.
FT   ACT_SITE    889    889       Glycyl thioester intermediate (By
FT                                similarity).

These are feature types, start/end position, and a description.
I imagine there is an ontology for these but I haven't been
following that work.

Structure is more complicated.  The biggest data source
for this is the PDB.  Things to worry about:

  * a PDB record may contain aggregates of protein, DNA, lipids,
waters, ions, ligands, post-translational modifications and
other bits and pieces.

  * the sequence listed for a chain may be different than
found from crystallography.

  * residue numbers in the structure may not be consecutive.  Eg,
in a chain the residue ids may be -2, -1, 1, 2A, 2B, 2C.  The
numbering is often done to preserve residue identifiers across
homologous structures.

  * some features are at the atomic level and not feature level.
For that matter, some people like things like "center of ring"
but I think we can ignore those.  Others like "binding pocket"
but there's no good way to specify that.

  * some residues have "alternate" conformations, eg, a side
chain that's believed to have two common orientations.  I
don't think we need to worry about this.

  * NMR structures (and others) may have multiple models.
I think we don't need to worry about this.  All programs I
know of handle these as alternate conformations and have
no way to say a given feature is on only one of those
conformations.

  * some features may be over several regions of a protein,
or across several different chains.  Eg, a disulphide bond
between two different proteins or an indicator of a beta
barrel composed of multiple proteins

  * strange things, like a protein covalently bonded to a
piece of DNA.  Those chemists are so whacky!  Here's a
picture of one done in my old group
   http://www.ks.uiuc.edu/Research/pro_DNA/hmgd/SDNA_t.gif
from
   http://www.ks.uiuc.edu/Research/pro_DNA/hmgd/
I think it's okay to linearize these.

  * crystal structures and symmetries.  One example that
comes to mind is the virus structure I worked on where
a beta sheet went from one protein chain on the given
protomer to another protein chain on the next protomer
around the 5-fold symmetry access.  But the structure
record only contains a single protomer.  I don't think
we need to worry about this because to the best of
my knowledge that information is not available in
any database; it's extracted by humans reading the
comments and associated papers.

Beyond the technical details,

  Who are the test users?

  What's the reference platform?  Should there even be one?
There's a boatload of 3d structure viewers.  A decade ago
Steven Brenner proposed a generic format for selection +
annotation information.  Perhaps that's a better path?

   Is writeback needed?

					Andrew
					dalke at dalkescientific.com