Bioperl: UPDB
Andrew Dalke
dalke@bioreason.com
Fri, 28 Aug 1998 23:02:34 -0700
Hello,
I've been working on what I hope to be my last PDB parser.
(I've been told by several people that I'm ambitious in
trying to do so :) I counted and I think I've written
5 or so incomplete PDB parsers over the few years, and that
gets boring real fast.
So I've written a Python script that reads the PDB format
documentation and generates code for a given language
to parse and/or generate column delimited record format
used by the PDB. It is called "UPDB".
Since the format documentation is used as input, this gives
users the ability to take several PDB format definitions
(eg, PDB version 1, PDB version 2.1, and XPLOR extensions
applied to version 1) create parser code. By writing a
driver that recognizes the differences between them, you
can easily write a parser that handles all the different
formats.
Of course, this only addresses the syntax issues. The
PDB has a large number of semantic problems.
Now, I wrote the code in Python and got the first alpha
version working yesterday. (BTW, during development I
found about 8 or so problems with the PDB spec itself.)
My code is designed to handle outputs for multiple languages,
or multiple styles for the same language, so in the last
couple hours I hacked in a module to generate a Perl parser.
That version is not as complete as the Python version.
In Python I generate a class for each style of parser and
create a master parser class that knows how to recognize the
input line layout and forward the request to the right subclass.
This is a class because it contains state -- once the format
for a given line is known, it uses that function for all
subsequent usages (since the format doesn't change inside
a file).
The problem is I don't know how to do Perl classes. I've
tried at other times, but got frustrated and ended up faking
classes as modules with state passed via anonymous hashes
or list of hashes.
That's why I'm here. Would anyone here like to work with me
to finish off the perl generation code? The end result
should be something like "use UPDB::Version2_1" or
"use UPDB::Version1" or "use UPDB::XPLOR" or whatever
interfaces you want.
Also, while I'm here, can anyone give me information about
common PDB extensions? I know about four character residue
names and XPLOR-style extended atom serial ranges. I know
there are others (for example, I've seen COLOR cards
defined by UCSF, I think).
Oh, and I forgot to mention, this code will be made freely
available.
Andrew Dalke
dalke@bioreason.com
P.S.
Here's my test script (there may be problems with 80 column mailers):
===========
require "test.pl";
sub dump_dict {
my($ref) = $_[0];
my($key, $val);
print "Dumping from $ref->{'type'}:\n";
foreach $key (keys %{$ref}) {
$val = $ref -> {$key};
print " $key = '$val'\n";
}
}
$pdb = <<EOF;
SEQADV 1AA1 KCX L 201 SWS P00875 LYS 201 MODIFIED RESIDUE
DBREF 1AA1 I 1 123 SWS P10795 RBS1_ARATH 56 178
SEQRES 1 L 475 MET SER PRO GLN THR GLU THR LYS ALA SER VAL GLY PHE
MODRES 1AA1 KCX L 201 LYS LYSINE NZ-CARBOXYLIC ACID
HET MG A 476 1
FORMUL 1 KCX C7 H14 N2 O4
HELIX 8 8 ILE L 155 LEU L 162 1 8
SHEET 2 HC 2 LEU H 290 HIS H 294 1 N LEU H 290 O VAL H 265
LINK N KCX E 201 C THR E 200
CISPEP 2 LYS B 175 PRO B 176 0 0.02
CRYST1 157.600 158.700 203.300 90.00 90.00 90.00 C 2 2 21 32
SCALE1 0.006345 0.000000 0.000000 0.00000
MTRIX3 3 0.000395 0.002990 0.999995 -0.11720 1
ATOM 9 NZ LYS L 21 -53.630 60.423 72.753 1.00 55.80 N
HETATM 1422 OX1 KCX L 201 -33.727 34.519 32.546 1.00 29.82 O
CONECT1870218701187031870418705
END
EOF
@pdb = split("\n", $pdb);
foreach $line (@pdb) {
$info = &pdb_unpack($line);
&dump_dict($info);
}
==========================
Here's part of the output
==========================
Dumping from HETATM:
serial = '1422'
name = ' OX1'
type = 'HETATM'
iCode = ' '
occupancy = '1'
charge = ' '
element = ' O'
resName = 'KCX'
segID = ' '
chainID = 'L'
resSeq = '201'
tempFactor = '29.82'
x = '-33.727'
y = '34.519'
z = '32.546'
altLoc = ' '
Dumping from CONECT:
serial = 'ARRAY(0x1010b38c)'
type = 'CONECT'
==========================
Here's part of the autogenerated code (currently has poor indentation,
and I don't know how to convert to float other than adding 0.0)
==========================
sub unpack_ATOM {
my($line)=$_[0];
my(%ret_data);
$ret_data{'serial'} = &from_serial(substr($line, 6, 5));
$ret_data{'name'} = substr($line, 12, 4);
$ret_data{'altLoc'} = substr($line, 16, 1);
$ret_data{'resName'} = &strip(substr($line, 17, 4));
$ret_data{'chainID'} = substr($line, 21, 1);
$ret_data{'resSeq'} = int(substr($line, 22, 4));
$ret_data{'iCode'} = substr($line, 26, 1);
$ret_data{'x'} = (substr($line, 30, 8) + 0.0);
$ret_data{'y'} = (substr($line, 38, 8) + 0.0);
$ret_data{'z'} = (substr($line, 46, 8) + 0.0);
$ret_data{'occupancy'} = (substr($line, 54, 6) + 0.0);
$ret_data{'tempFactor'} = (substr($line, 60, 6) + 0.0);
$ret_data{'segID'} = substr($line, 72, 4);
$ret_data{'element'} = substr($line, 76, 2);
$ret_data{'charge'} = substr($line, 78, 2);
return \%ret_data;
}
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================