[Biopython-dev] Moratorium on commits?

Lenna Peterson arklenna at gmail.com
Mon Aug 12 19:25:20 UTC 2013

On Mon, Aug 12, 2013 at 2:33 PM, Morten Kjeldgaard <mok at bioxray.dk> wrote:

> On 11/08/2013, at 22:50, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > I still feel this was a minor change (although of
> > course important to some, including you). This is
> > parsing of malformed PDF files where the user
> > ALREADY gets a warning (or error in strict mode,
> > where there would be no functional change) that
> > there is a problem with the occupancy data.
> >
> > One reason why I specifically talked about small
> > commits (in the sense of a simple diff) above is
> > they are trivial to revert if the need arises, or as
> > in this case, modify:
> >
> https://github.com/biopython/biopython/commit/500c3c2ea900fd8c8f5123f571d4d9a244ee898e
> >
> > This change was suggested and supported by
> > people who've been actively contributing to the
> > Biopython structural module for some time, so I
> > had reason to trust their good judgement, and as
> > I wrote at the time there was a clear consensus
> > with three people in all happy with the idea:
> >
> http://lists.open-bio.org/pipermail/biopython-dev/2013-August/010773.html
> I respect that you listen more to developers that have been contributing
> for a long time. That is quite understandable, but I hope that does not
> prevent me from contributing my opinions.
> What prompted my response was the suggestion that the occupancy should be
> set to 1.0 if it is abscent from the file, i.e. if the PDB file is
> malformed. I think that is an incorrect behavior, and I say that not as a
> core developer, but as a crystallographer. If invalid data is present in
> the file, you do not want the toolkit transforming it to valid data.

 I appreciate the physical/practical feedback about the commits.

After thinking about it, the suggestion to set values to None when they are
> not defined in a malformed file now appears quite reasonable, but if it is
> done this way with occupancies, it should also done this way with
> B-factors, chain identifiers and other values that are mandatory in the
> file according to the format specs. From the users perspective, if the
> values returned are None, you are alerted to the fact that something is
> wrong, and you should make an appropriate choice, whatever that may be.
I agree that `None` is a good warning value for missing data.

I just skimmed the code and summarized how some of the missing values are

* Serial number: 0
* Chain: fatal in both strict and permissive modes (i.e. no try/except)
* Coordinates: fatal in both strict and permissive modes
* Occupancy: we recently decided to set as None in permissive
* B-factor: 0.0 in permissive (code comment states this is PDB default)
* Model seq id: 0

The StructureBuilder class also has certain ways of handling duplicate
residues and atoms that I'm not particularly familiar with. For example,
I'm not quite sure what will happen if successive atoms have missing serial

PDB is a format where there's always a balance between absolute adherence
to the format and enough flexibility to deal with the wide range of
malformed files.


More information about the Biopython-dev mailing list