[DAS2] structure DAS
Andrew Dalke
dalke at dalkescientific.com
Fri Jun 3 18:50:47 UTC 2005
I've been thinking more about the general idea of a structure DAS.
I think it would be good to have someone with more recent
(and better) structure knowledge than I do. This may be
the woman from RCSB mentioned yesterday. Another idea is
Steven Brenner.
There are two main ways to think about proteins: sequence
and conformation.
The sequence model is similar to that used for DNA.
Sequences have residues arranged in a line, with positions
numbered by position.
The biggest database for this is SWISS-PROT. Here's
an example of features
FT DOMAIN 583 920 HECT.
FT REGION 515 571 PABP-like.
FT COMPBIAS 108 119 Asp/Glu-rich (acidic).
FT COMPBIAS 158 181 Pro-rich.
FT COMPBIAS 451 470 Arg/Glu-rich (mixed charge).
FT COMPBIAS 479 488 Arg/Asp-rich (mixed charge).
FT COMPBIAS 610 621 Asp/Glu-rich (acidic).
FT COMPBIAS 858 878 Pro-rich.
FT ACT_SITE 889 889 Glycyl thioester intermediate (By
FT similarity).
These are feature types, start/end position, and a description.
I imagine there is an ontology for these but I haven't been
following that work.
Structure is more complicated. The biggest data source
for this is the PDB. Things to worry about:
* a PDB record may contain aggregates of protein, DNA, lipids,
waters, ions, ligands, post-translational modifications and
other bits and pieces.
* the sequence listed for a chain may be different than
found from crystallography.
* residue numbers in the structure may not be consecutive. Eg,
in a chain the residue ids may be -2, -1, 1, 2A, 2B, 2C. The
numbering is often done to preserve residue identifiers across
homologous structures.
* some features are at the atomic level and not feature level.
For that matter, some people like things like "center of ring"
but I think we can ignore those. Others like "binding pocket"
but there's no good way to specify that.
* some residues have "alternate" conformations, eg, a side
chain that's believed to have two common orientations. I
don't think we need to worry about this.
* NMR structures (and others) may have multiple models.
I think we don't need to worry about this. All programs I
know of handle these as alternate conformations and have
no way to say a given feature is on only one of those
conformations.
* some features may be over several regions of a protein,
or across several different chains. Eg, a disulphide bond
between two different proteins or an indicator of a beta
barrel composed of multiple proteins
* strange things, like a protein covalently bonded to a
piece of DNA. Those chemists are so whacky! Here's a
picture of one done in my old group
http://www.ks.uiuc.edu/Research/pro_DNA/hmgd/SDNA_t.gif
from
http://www.ks.uiuc.edu/Research/pro_DNA/hmgd/
I think it's okay to linearize these.
* crystal structures and symmetries. One example that
comes to mind is the virus structure I worked on where
a beta sheet went from one protein chain on the given
protomer to another protein chain on the next protomer
around the 5-fold symmetry access. But the structure
record only contains a single protomer. I don't think
we need to worry about this because to the best of
my knowledge that information is not available in
any database; it's extracted by humans reading the
comments and associated papers.
Beyond the technical details,
Who are the test users?
What's the reference platform? Should there even be one?
There's a boatload of 3d structure viewers. A decade ago
Steven Brenner proposed a generic format for selection +
annotation information. Perhaps that's a better path?
Is writeback needed?
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list