[Bioperl-l] Parsing PDB entries in BioPerl

Tue, 13 Nov 2001 12:53:08 -0600 (CST)

Hi kris,

> The idea is to parse every line in the entry and to have access to all
> the data via some Bio:: object. The work on the SeqIO parser (Bio::SeqIO::pdb)
> is progressing nicely.

Let me start with saying that I dont have much experience with perl.
However I have seen lots of PDB files and the kind of inconsistencies in
the format (sometimes biologically relevant). If possible I would
recommend you to process the equivalent CIF file of the PDB. You will find
the CIF file much easier to handle than the PDB file and ofcourse you will
be able to get all the information that is present in the PDB file.

> 
> For the moment I'm working on parsing all the different 'records' (PDBspeak
> for different lines) and not so much on how to store the info in a Bio:: 
> object (references are already stored in Bio::Annotation::Reference objects).
> The moment to start thinking abouth 'how' to store 'what' inside 'which' 
> Bio::* object has arrived. 
> 
> My first thought was to inherit from a Bio::Seq object, but this does
> not seem to be the right approach
>   - which sequence to store (the one from Swiss-Prot)

the one in the SEQRES record. 

>   - not every residue has coordinates (C,N terminal)
Yes. C, N, terminal , disordered loops to name a few. Sometimes the side
chain coordinates will be missing, and in some pdb files, the residue for
which the sidechain was not seen in the electron density ( if it is a
x-ray structure) is modelled as a alanine. In some files it is referred to
as ALA and in others it is referred to by whatever amino acid it is but
with only the coordinates for the main chain and maybe the CB.

>   - PDB entries can consist of multiple 'chains' (i.e. a complex of two
>     proteins)

Ribosome structure is a good example ( an overkill) for this. ( pdbid :
1jj2) . 

>   - how to handle post-translational modifications

generally non standard amino acids and post-translational modifications
are marked as HETATM.

>   - there is no easy access to the data that makes PDB special (x,y,z
>     coordinates, ...)

the CIF file would be the way to go for this.

>   - how to handle 'models' (structures determined by NMR, do not consist
>     of one, but multiple entries).

NMR structures or structures that contain multiple entries generally have
the models split by the 
MODEL      1
... coordinates
ENDMDL
MODEL      2
...  coordinates
ENDMDL

if there is only one model (NMR minimised) or if the structure has been
solved by X-ray crystallography, you will find the PDB file ending with
the
END

record.
 > 
> This suggests that a new type of object might be needed. To start
> thinking about this I think it might be good to think about how the user
> might use this object (i.e. 'which questions would you ask ?). So
> therefor I would want to ask you which data in a PDB entry you're
> typically intrested in and which questions you want to ask to such an
> object.

Well, this is a tough question. I would think everything is important in
the PDB file. Again it depends on the nature of a particular problem.

krishna

*******************************************
Sri Krishna S.
Postdoctoral Researcher / Department of Biochemistry
U.T. Southwestern Medical Center at Dallas
5323 Harry Hines Blvd., Dallas, TX 75390-9038
Tel:(214)648-7119 (Office)     (214)772-9439 (Home) 
Fax:(214)648-9099
krishna@chop.swmed.edu
*******************************************