[Biopython] (Bio.PDB) problems with NeighborSearch: error at levels above "A", residue index discrepancy with unfold_entities

Thu Oct 17 22:05:48 UTC 2013

Hi, João,

A late reply is much better than no reply. I'm impressed you tracked 
this down and followed up, and I appreciate your help. And it took me a 
while to get around to revisiting this myself.

I was using Python 3.2 when I got the "unorderable types" error. For 
unrelated reasons, I ended up switching to Python 2.7.3, and now doing 
search_all at the residue level works.

The issue with the indexing is not that the residues' get_id() function 
returns a different number from the corresponding list index in the list 
returned by Selection.unfold_entities(). That is inconvenient, but I've 
been working around it. What puzzled me is that it appeared that the 
NeighborSearch was accessing residues that unfold_entities() wasn't 
accessing, although this wouldn't make sense because I used 
NeighborSearch on the results of a call to unfold_entities(). Let me 
check again; it could have been something I was doing wrong.

What do the chain breaks mean? Are they missing data, and if so, what is 
missing? And what are their consequences for working with the data? How 
would they be problematic for iterating over residues, calculating 
distances, returning the amino acid sequence of the structure, etc?

Thanks again,

James

On 10/08/2013 02:22 PM, João Rodrigues wrote:
> Dear James,
>
> Regarding problem 1. What you describe runs fine on my machine, using 
> Python 2.7.5 and an up-to-date Biopython git version. You logic seems 
> fine, maybe it's the version of Python you are using?
>
> Regarding your second problem, that of the mismatched indexes. The 
> Selection method returns a *list* of residues while when you iterate 
> over the neighbors and ask for their id it gives back the id of the 
> residue. This id will only correspond to the Selection list index if 
> your residues are numbered from 1 to N without gaps. If your protein 
> starts at residue 3, then the first item given back by Selection has 
> index 0 while in fact the id is 3. Does this make sense?
>
> The warning occurs if you have chain breaks. There should be some gaps 
> in your structure, starting at a number other than 1 does not raise 
> this warning normally.
>
> Cheers and sorry for the late reply,
>
> João
>
>
>
> 2013/8/30 James Jensen <jdjensen at eng.ucsd.edu 
> <mailto:jdjensen at eng.ucsd.edu>>
>
>     Hello!
>
>     I am writing a function that, given two chains in a PDB file,
>     should return 1) the positions and identities of all residues that
>     are in contact with (distance < 5 angstroms) a residue on the
>     other chain, and 2) the amino acid sequences of the chains. I've
>     been doing this with NeighborSearch.search_all(radius=5,
>     level='A') and then for each atom pair, seeing what its parent
>     residue is and whether the parent residues of the two atoms belong
>     to different chains. This may seem like a roundabout way of doing
>     it, but if I call search_all(radius=5, level='R'), or indeed with
>     level=any level other than 'A', I get the error
>
>             TypeError: unorderable types: Residue() < Residue()
>
>     So my first question is why it might be that search_all isn't
>     working at higher levels.
>
>     For the adjacent residue pairs I identify using NeighborSearch, I
>     get each residue's position in its respective chain by
>     residue.get_id()[1].
>
>     I've noticed, however, that if I get the sequence of the chain
>     using seq = Selection.unfold_entities(chain, 'R') and then
>     reference (i.e. seq[index]) the amino acids using the indices
>     returned by the NeighborSearch step, they are not the same
>     residues that I get if during the NeighborSearch step I report
>     residue.get_resname() for each adjacent residue.
>
>     I've tried it with several proteins, and the problem is the same.
>     Chains A and C of 2h62 are an example.
>
>     I then noticed that the lowest residue ID number of the residues
>     yielded from Selection.unfold_entities(chain, 'R') is not 1. For
>     chain A, it's 11, and for chain C, it's 34. Not knowing why this
>     was, I thought I'd try subtracting the lowest ID number from the
>     indices returned by the NeighborSearch step (i.e. in chain A, 11
>     -> 0 so seq[0] would be the first residue, the one with ID 11).
>     This happened to seem to work for chain A. However, it gives me
>     negative indices for some of the contacts in chain C. This means
>     that NeighborSearch can return residues that are not returned by
>     unfold_entities(). The lowest residue ID returned by
>     NeighborSearch for chain C was 24, whereas for unfold_entities()
>     it was 34.
>
>     For both chains A and C, I was given the warning
>
>             PDBConstructionWarning: WARNING: Chain [letter] is
>     discontinuous at line [line number].
>
>     In fact, I seem to get this warning for just about every chain of
>     every structure I load. Is this the reason that the first residues
>     in the two chains are at 11 and 34, rather than 1? If so, could it
>     be that NeighborSearch is able to work around the discontinuity
>     while unfold_entities is not?
>
>     Any suggestions?
>
>     Thanks for your time and help,
>
>     James Jensen
>     _______________________________________________
>     Biopython mailing list  - Biopython at lists.open-bio.org
>     <mailto:Biopython at lists.open-bio.org>
>     http://lists.open-bio.org/mailman/listinfo/biopython
>
>