[Bioperl-l] PDB parser

Wed, 28 Nov 2001 14:45:50 +0100

I've commited the first public release of a PDB parser to bioperl-live. 
It has already succesfully parsed 10% of all PDB entries, so I'm pretty 
confident it is pretty usable and stable.

At the moment it does not do writing, but a lot is already in place to
make this possible (see further). It should be noted that although every
line is read, not every record is parsed (a record in PDB-speak consists
of one or more lines beginning with the same record name, eg. COMPND).

In the header section (everything before the coordinates) only the JRNL,
REMARK 1, DBLINK records are fully parsed and stored as an Annotation object
with the corresponding type. The other records are read and stored as
SimpleValue Annotation objects, with the record name as annotation key.
The header name is not stored, the value begins at the first non-blank
character and is space padded (PDB format is column based). Multiple
lines are concatenated.

  my $structio = Bio::Structure::IO->new(-file => $filename, -format => "PDB");
  my $struc = $structio->next_structure;
  my ($ann) = $struc->annotation->get_Annotations("compnd");
  # $ann is an Annotation object containing the data from the COMPND record

The coordinate section (MODEL, ATOM, HETATM, ANISOU, ..) is fully parsed
and methods for accessing all data are provided

  # continuing from the previous example
  for my $model ($struc->get_models) {
     for my $chain ($struc->get_chains($model)) {
	print "chain ",$chain->id,"\n";
	for my $res ($struc->get_residues($chain)) {
		print"\tresidue ",$res->id,"\n";
		for my $atom ($struc->get_atoms($res)) {
			# do something with Atom object
			my ($x,$y,$z) = $atom->xyz;
		}
	}
     }
  }

Things that will/might be done in the (near) future.

- make Bio::Structure::Entry and Chain Bio::Seq compliant (i.e. being
  able to get the sequence via $struc->seq and $chain->seq)
- improve parsing speed: as a lot of objects are created under the hood,
  it is not surprising that parsing a big PDB entry can take time (up to 
  one minute). 
- implement write_structure
- better handling of alternate locations: at this time only the first
  alternate is taken, the rest is ignored.
- change atom names to IUPAC conventions: who is intrested ?
- implement additional methods: get_all_atoms_within_x_A(), phi(),
  psi(), get_residue_by_id("Val-42"), ...
- write parsers for specific records if need arises

All comments, criticism welcome.

Kris,
-- 
Kris Boulez 				Tel: +32-9-241.11.00
AlgoNomics NV 				Fax: +32-9-241.11.02
Technologiepark 4 			email: kris.boulez@algonomics.com
B 9052 Zwijnaarde 			http://www.algonomics.com/