[BioPython] Comment/Suggestion about Bio.PDB.Polypeptide class. How to keep gaps information ?

Tue May 24 14:26:32 EDT 2005

Hi Julie,

> i.e.: We totally lose the information of gaps. "pp1" still contains this
> information but cannot give it to "seq" even if using the gapped
> alphabet.
> I know it would be possible to get it from an iteration on residue from
> the structure. However, I think it would be better to fill gap with an
> 'X' or a '-' while doing pp1.get_sequence(). I mean changing the method
> get_sequence to handle this case.

I'll start with pointing out that you cannot rely on the fact that
the resseq numbering is meaningfull AT ALL. There are plenty of structures
in the PDB where residue X is firmly attached to residue X+Y (with Y>1)
and structures where X is not attached to X+1. That's the reason why
Bio.PDB uses a distance criterium to find polypeptides.

OTOH it would certainly be useful to have gap information, but I'd like to
put that in a seperate class, ie. BrokenPolypeptide. PolypeptideBuilder
could have a method build_broken_peptide that would return a
BrokenPolypeptide object. That class could have fancy methods to deal with
gaps and the sequences of the missing parts, for example.

I'll try to add this, but I'm busy at the moment (4 articles in the
pipeline), but you're welcome to give it a try and send me your code :-).

Best regards,

-Thomas