Back (for now)

Thu, 3 Jul 1997 18:10:29 +0900 (JST)

Dear Steve,

  Thanks for your detailed thoughts about the relationships between
modules for 2D and 3D structure.  [My comments deal with your overall
design; I agree with all the issues, so I haven't repeated the text.]

  I like the general outline; I think I agree with you that 2D structure
is a module which should somehow be applicable to both 3D and 1D
structure.   One consideration is that every (modern) 3D protein structure
has a known 1D structure (i.e., sequence).   So, perhaps an easy way to
impelement all of this would be that a 3D-structure has a 1D-structure,
and a 1D-structure has a 2D structure.   

  I use 'has a,' as one sort of relationship, though I haven't figured out
if that is best.  Comments appreciated!  One reason for this approach is
that 1D-structure and 2D-structure are both discrete and linear.
3D-structure is neither; any atom can be in any place (though there are
obviously some correlations), and the atomic geometry is not a linear. So,
2D-structure more neatly maps onto 1D-structure; since we do need to link
the 1D and 3D strucutre, we might as well use that link to get to the 2D
as well.

  I like your thoughts about folds (e.g., 4-helix-bundle), as a
description of the 3D structure; I had not previously considered this.
However, these describe a domain as a whole rather than any particular
details of either the secondary or tertiary structure.  Perhaps we should
have a DomainDescription module which is sort of like the 2D-structure
module. Where 2D-structure contains secondary structure elements,
DomainDescriptions have folds. A tricky caveat here is that folds can be
discontinuous in sequence.

> However, there's one case where I can see some overlap between 3D and 2D 
> structural issues: circular dichroism (CD) experiments. Using CD you can 
> estimate the overall percentage of helix, sheet, and coil in a protein
...

I think that these data are not archived anywhere and are basically not
much trusted.  They can be useful and we should keep the possibility of
using them open.  However, I don't think that they are of sufficient
import that they should play a large role in building the hierarchy.

> One more point: my hypothetical Bio::Struct.pm module doesn't know 
> anything about 3D structures but delegates this task to Bio::Struct::PDB.pm. 
> Similarly, there could be another module that handles strictly 2D issues. 

Naming is more of a philosophical and political question than a techical
one.  On these grounds, I think that it is important that the object which
knows about coordinates be Bio::Struct.  The reason is that the thing most
people will want to do most often is parse in a PDB file and do something
with it -- this "jumble of coordinates" will be the "currency" for
structures just as "Bio::Seq" will be the corresponding one for sequences.

To reduece learning curve and to make things appear as simple as possible,
I think that having a 'Bio::Seq' and a 'Bio::Struct' which are
more-or-less capable of appearing to do everything necessary is important.

> I decided to go ahead and create a scop module it since I knew I 
> would be doing alot of work with scop data.  scop_dict.cf is a little 
> dictionary I created for converting between class/fold number to class/fold 
> name. You probably already have such a thing, but it was easy enough to 
> create. Here's a snippet: 

I see.  We do have a similar type of thing which uses cdb files.  (It's
just a set of functions.  For various historical and performance reasons,
scop is not very OO).  As an aside, cdb files are great!

> > I have no objection to this, but curious to know why you want to
> > be able to do slices for revcom, etc.
> 
> I needed to process sequences for all genes on a yeast chromosome. It 
> seemed easiest to create a big PreSeq object for the chromosomal sequence 
> and then extract sub-sequences for each gene as needed. Since some genes 
> are on the complementary strand, I needed revcom() to work like str(). 
> See, for example:
> http://genome-www.stanford.edu/~sac/perlOOP/bioperl/lib/Bio/Gene/Seq.pm

Ok; this makes sense.  I had forgotten about revcom's current
impelmentation.  One idea was that it would modify the existing object;
another idea was that it would return a modified object.  Right now it
seems to be roughly in-between. :)

My suggested modification (probably can't show up until Bio::Seq) would be
for revcom to return an object with the required modification.  Probably
my preferred calling sequence would be:

$mybackgene = new Bio::Preseq ($mychromasome->str($end,$beg));
$mygene = $mybackgene->revcom();
print $mygene->str(), "\n";

Or, maybe we should add another method like getseq to return a sequence
object of a slice:

$mybackgene = $mychromasome->get_seq_obj($end,$beg);
   # ick!  get_seq_obj is a horrible method name!    
$mygene  = $mybackgene->revcom();

Steve