[Bioperl-l] PDB sequence from ATOM records
Alex Gutteridge
alexg at ebi.ac.uk
Mon Jul 5 06:35:08 EDT 2004
> Hi,
>
> I need to be able to map some protein sequence alignment information
> on to
> a protein structure. To do this I need to get the sequence from the
> ATOM
> records, since the SEQRES sequence is often not exactly the same.
>
> So I'd like to change Bio::Structure::Residue a little:
>
> Amino-acid type and residue number are currently contained in one
> value,
> residue->id. I would like to separate them in to two, residue->type
> and residue->num. Then, for backwards compatibility, construct
> residue->id
> from these each time it is required (or store it as well if that is
> better?). residue->type should be able to return the one-letter code as
> well as the three-letter code.
>
> And then have a method called something like
> Bio::Structure::Entry->atom_seq
> that would return a Bio::PrimarySeq object constructed from the
> one-letter
> codes of the residues of a particular chain.
>
> Any comments please... Thanks.
>
> Matthew
>
> P.S. Sorry if this message is a repeat - our email server went down as
> I
> was sending it the first time.
>
> --
> Matthew Betts, mailto:matthew.betts at ii.uib.no
> Phone: (+47) 55 58 40 22, Fax: (+47) 55 58 42 95
> CBU, BCCS, HiB, UNIFOB / Universitetet i Bergen
> Thormøhlensgt. 55, N-5008 Bergen, Norway
>
Hi,
I added a similar thing to my local copy of the BioPerl structure
modules a while ago (I'm not sure who - if anyone is currently
maintaining Bio::Structure::*). I can't seem to find my implementation,
since I don't really use Bio::Structure::* anymore, but if I do, I'll
forward it to you. It's definitely a useful method to have, like you
say, SEQRES can be quite different to what you actually see in the ATOM
records. It's simple enough to implement, but keep in mind the
following gotchas:
Gaps in the crystal structure: You'll have to pad with 'X's where the
ATOM records are missing. E.g. if you have residues
GLY40,GLU41,CYS44,LYS45. You need to make the sequence GEXXCK.
Insertion codes: PDB files can have residue 'sequences' like:
40,41A,41B,42. You need to be careful not to just look at the number,
otherwise you might miss a residue.
Strange starting points: There are PDB files whose residue numbering
starts in the negative! I.e. -5,-4,-3,-2,-1,0,1. Make sure your code
can deal with this.
The only other comment I would make is that I think the atom_seq method
should be attached to the Chain object, not the Entry object. And so
called by $chain->atom_seq not $entry->atom_seq('A'). The sequence is a
property of a chain so this makes the most sense to me personally.
Alex Gutteridge
European Bioinformatics Institute
Cambridge CB10 1SD
UK
Tel: 01223 492550
Email: alexg at ebi.ac.uk
More information about the Bioperl-l
mailing list