[Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943)

Tue Apr 24 18:20:16 UTC 2012

On Tue, Apr 24, 2012 at 1:56 PM, Lenna Peterson <arklenna at gmail.com> wrote:
> On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson <arklenna at gmail.com> wrote:
>> > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> >>
>> >> Ack, I didn't look at that closely enough. Check out this patch to see
>> >> the current situation:
>> >> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
>> >>
>> >> The models associated with a structure are numbered with a sequential
>> >> integer id, starting from 0. It's always been like that in our PDB
>> >> parser and we haven't changed it. To ensure that model numbers
>> >> specified in the PDB file are preserved when writing the PDB back to
>> >> file, the above patch introduced a new attribute on the Model object
>> >> called serial_num (also an integer, equal to model.id unless specified
>> >> otherwise). That attribute is only used when writing a new PDB file;
>> >> Model.__getitem__ still uses Model.id as before.
>> >>
>> >> Perhaps that's surprising now that we read the serial numbers, but it
>> >> kept backward compatibility. Plus, it preserves list-like behavior
>> >> (item access via integers), even though the models are actually stored
>> >> in a dict.
>> >>
>> >> So!
>> >>
>> >> In the mmCIF parser, the calls to structure_builder.init_model should
>> >> be given two arguments instead of one: an integer id counting from 0,
>> >> and then another integer (probably) containing the model "serial
>> >> number" specified in the mmCIF file. In the event that an mmCIF file
>> >> doesn't specify the model number, the serial number should be the same
>> >> as the sequential id.
>> >>
>> >> Cool? This will also help us convert between PDB and mmCIF formats in
>> >> the future.
>> >>
>> >> As for accessing the models by their serial number, using string keys
>> >> seems like an effective workaround, but still obviously a workaround
>> >> rather than an ideal situation. Let's discuss that a little more,
>> >> perhaps file another bug when we've reached some consensus.
>> >>
>> >> Best,
>> >> Eric
>> >
>> >
>> > Hi Eric,
>> >
>> > I believe I've implemented the model_id/serial_id system found in PDB:
>> >
>> > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d
>> >
>> > Please let me know if you think that looks right. I couldn't find an
>> > mmCIF file without a model column to test, but I believe in that case
>> > it will assign model_id and serial_id to 0. Would that be the correct
>> > behavior?
>> >
>> > I also modified the unit test to check the model serial_num.
>> > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6
>> >
>> > Currently serial_num is int() of the CIF model column. Regarding
>> > access by string serial_num, I am concerned that the int/string access
>> > would be too subtle (structure[0] == structure['1']; structure[1] ==
>> > structure['2']?). Perhaps an accessor function? i.e.
>> > structure.get_model('1')
>> >
>> > Let me know if you think I should write get_model() or something along
>> > those lines.
>> >
>> > Lenna
>>
>> I left another nitpick on b453a, but besides that it looks exactly right to me.
>>
>> The string/int distinction would indeed be weird, especially for newer
>> Python users coming from Perl or Javascript. I don't see a direct
>> analogue for get_model(serial_num) in the other Entities (Residue,
>> Chain, Model, Structure), so I'm inclined to put off the decision for
>> now (i.e. leave it out of this patch set).
>>
>> -Eric
>
>
> Eric,
>
> Okay, I've changed the bad model num generic warning to a
> PDBConstructionException.
>
> New pull request to get MMCIF to the same state as PDB:
> https://github.com/biopython/biopython/pull/36
>
> So are chains accessed by 0, 1, 2 or by A, B, C?
>
> Lenna

Cool, I just merged the pull request. Thanks!

As João said, chains are accessed by the letter ID via __getitem__
(implemented in Bio.PDB.Entity). You can get at them either way
through the child_list and child_dict attributes, too. Kind of a
thrill. I suppose we could eventually refactor the Entity-based
classes to use a single data structure (OrderedDict, namedtuple, numpy
array with named columns/rows?) in place of child_dict and child_list,
and clean up some of the redundant accessors.

-E