Bioperl: UPDB

Fri, 28 Aug 1998 23:02:34 -0700

Hello,

I've been working on what I hope to be my last PDB parser.
(I've been told by several people that I'm ambitious in
trying to do so :)  I counted and I think I've written
5 or so incomplete PDB parsers over the few years, and that
gets boring real fast.

So I've written a Python script that reads the PDB format
documentation and generates code for a given language
to parse and/or generate column delimited record format
used by the PDB.  It is called "UPDB".

Since the format documentation is used as input, this gives
users the ability to take several PDB format definitions
(eg, PDB version 1, PDB version 2.1, and XPLOR extensions
applied to version 1) create parser code.  By writing a
driver that recognizes the differences between them, you
can easily write a parser that handles all the different
formats.

Of course, this only addresses the syntax issues.  The
PDB has a large number of semantic problems.

Now, I wrote the code in Python and got the first alpha
version working yesterday.  (BTW, during development I
found about 8 or so problems with the PDB spec itself.)
My code is designed to handle outputs for multiple languages,
or multiple styles for the same language, so in the last
couple hours I hacked in a module to generate a Perl parser.

That version is not as complete as the Python version.
In Python I generate a class for each style of parser and
create a master parser class that knows how to recognize the
input line layout and forward the request to the right subclass.
This is a class because it contains state -- once the format
for a given line is known, it uses that function for all
subsequent usages (since the format doesn't change inside
a file).

The problem is I don't know how to do Perl classes.  I've
tried at other times, but got frustrated and ended up faking
classes as modules with state passed via anonymous hashes
or list of hashes.

That's why I'm here.  Would anyone here like to work with me
to finish off the perl generation code?  The end result
should be something like "use UPDB::Version2_1" or
"use UPDB::Version1" or "use UPDB::XPLOR" or whatever
interfaces you want.

Also, while I'm here, can anyone give me information about
common PDB extensions?  I know about four character residue
names and XPLOR-style extended atom serial ranges.  I know
there are others (for example, I've seen COLOR cards
defined by UCSF, I think).

Oh, and I forgot to mention, this code will be made freely
available.

						Andrew Dalke
						dalke@bioreason.com

P.S.
  Here's my test script (there may be problems with 80 column mailers):
  ===========
require "test.pl";
sub dump_dict {
        my($ref) = $_[0];
        my($key, $val);
        print "Dumping from $ref->{'type'}:\n";
        foreach $key (keys %{$ref}) {
          $val = $ref -> {$key};
          print "  $key = '$val'\n";
        }
}
$pdb = <<EOF;
SEQADV 1AA1 KCX L  201  SWS  P00875    LYS   201 MODIFIED RESIDUE               
DBREF  1AA1 I    1   123  SWS    P10795   RBS1_ARATH      56    178             
SEQRES   1 L  475  MET SER PRO GLN THR GLU THR LYS ALA SER VAL GLY PHE          
MODRES 1AA1 KCX L  201  LYS  LYSINE NZ-CARBOXYLIC ACID                          
HET     MG  A 476       1                                                       
FORMUL   1  KCX    C7 H14 N2 O4                                                 
HELIX    8   8 ILE L  155  LEU L  162  1                                   8    
SHEET    2  HC 2 LEU H 290  HIS H 294  1  N  LEU H 290   O  VAL H 265           
LINK         N   KCX E 201                 C   THR E 200                        
CISPEP   2 LYS B  175    PRO B  176          0         0.02                     
CRYST1  157.600  158.700  203.300  90.00  90.00  90.00 C 2 2 21     32          
SCALE1      0.006345  0.000000  0.000000        0.00000                         
MTRIX3   3  0.000395  0.002990  0.999995       -0.11720    1                    
ATOM      9  NZ  LYS L  21     -53.630  60.423  72.753  1.00 55.80           N  
HETATM 1422  OX1 KCX L 201     -33.727  34.519  32.546  1.00 29.82           O  
CONECT1870218701187031870418705                                                 
END                                                                             
EOF

@pdb = split("\n", $pdb);
foreach $line (@pdb) {
  $info = &pdb_unpack($line);
  &dump_dict($info);
}
==========================
Here's part of the output
==========================
Dumping from HETATM:
  serial = '1422'
  name = ' OX1'
  type = 'HETATM'
  iCode = ' '
  occupancy = '1'
  charge = '  '
  element = ' O'
  resName = 'KCX'
  segID = '    '
  chainID = 'L'
  resSeq = '201'
  tempFactor = '29.82'
  x = '-33.727'
  y = '34.519'
  z = '32.546'
  altLoc = ' '
Dumping from CONECT:
  serial = 'ARRAY(0x1010b38c)'
  type = 'CONECT'

==========================
Here's part of the autogenerated code (currently has poor indentation,
and I don't know how to convert to float other than adding 0.0)
==========================

sub unpack_ATOM {
        my($line)=$_[0];

my(%ret_data);
$ret_data{'serial'} = &from_serial(substr($line, 6, 5));
$ret_data{'name'} = substr($line, 12, 4);
$ret_data{'altLoc'} = substr($line, 16, 1);
$ret_data{'resName'} = &strip(substr($line, 17, 4));
$ret_data{'chainID'} = substr($line, 21, 1);
$ret_data{'resSeq'} = int(substr($line, 22, 4));
$ret_data{'iCode'} = substr($line, 26, 1);
$ret_data{'x'} = (substr($line, 30, 8) + 0.0);
$ret_data{'y'} = (substr($line, 38, 8) + 0.0);
$ret_data{'z'} = (substr($line, 46, 8) + 0.0);
$ret_data{'occupancy'} = (substr($line, 54, 6) + 0.0);
$ret_data{'tempFactor'} = (substr($line, 60, 6) + 0.0);
$ret_data{'segID'} = substr($line, 72, 4);
$ret_data{'element'} = substr($line, 76, 2);
$ret_data{'charge'} = substr($line, 78, 2);
return \%ret_data;
}
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================