[Bioperl-l] PDB sequence from ATOM records

Mon Jul 5 06:35:08 EDT 2004

> Hi,
>
> I need to be able to map some protein sequence alignment information 
> on to
> a protein structure. To do this I need to get the sequence from the 
> ATOM
> records, since the SEQRES sequence is often not exactly the same.
>
> So I'd like to change Bio::Structure::Residue a little:
>
> Amino-acid type and residue number are currently contained in one 
> value,
> residue->id. I would like to separate them in to two, residue->type
> and residue->num. Then, for backwards compatibility, construct 
> residue->id
> from these each time it is required (or store it as well if that is
> better?). residue->type should be able to return the one-letter code as
> well as the three-letter code.
>
> And then have a method called something like 
> Bio::Structure::Entry->atom_seq
> that would return a Bio::PrimarySeq object constructed from the 
> one-letter
> codes of the residues of a particular chain.
>
> Any comments please... Thanks.
>
> Matthew
>
> P.S. Sorry if this message is a repeat - our email server went down as 
> I
> was sending it the first time.
>
> -- 
> Matthew Betts, mailto:matthew.betts at ii.uib.no
> Phone: (+47) 55 58 40 22, Fax: (+47) 55 58 42 95
> CBU, BCCS, HiB, UNIFOB / Universitetet i Bergen
> Thormøhlensgt. 55, N-5008 Bergen, Norway
>

Hi,

I added a similar thing to my local copy of the BioPerl structure 
modules a while ago (I'm not sure who - if anyone is currently 
maintaining Bio::Structure::*). I can't seem to find my implementation, 
since I don't really use Bio::Structure::* anymore, but if I do, I'll 
forward it to you. It's definitely a useful method to have, like you 
say, SEQRES can be quite different to what you actually see in the ATOM 
records. It's simple enough to implement, but keep in mind the 
following gotchas:

Gaps in the crystal structure: You'll have to pad with 'X's where the 
ATOM records are missing. E.g. if you have residues 
GLY40,GLU41,CYS44,LYS45. You need to make the sequence GEXXCK.
Insertion codes: PDB files can have residue 'sequences' like: 
40,41A,41B,42. You need to be careful not to just look at the number, 
otherwise you might miss a residue.
Strange starting points: There are PDB files whose residue numbering 
starts in the negative! I.e. -5,-4,-3,-2,-1,0,1. Make sure your code 
can deal with this.

The only other comment I would make is that I think the atom_seq method 
should be attached to the Chain object, not the Entry object. And so 
called by $chain->atom_seq not $entry->atom_seq('A'). The sequence is a 
property of a chain so this makes the most sense to me personally.

Alex Gutteridge
European Bioinformatics Institute
Cambridge CB10 1SD
UK

Tel: 01223 492550
Email: alexg at ebi.ac.uk