[Biopython] Reading large files, Biopython cookbook example

Peter Cock p.j.a.cock at googlemail.com
Tue Aug 6 09:35:25 UTC 2013


On Tue, Aug 6, 2013 at 12:09 AM, Andrew Dalke <dalke at dalkescientific.com> wrote:
> A bit late, but a bit of background:
>
>> On Sun, Jul 14, 2013 at 5:40 PM, Katrina Lexa <klexa at umich.edu> wrote:
>>> My PDB file came from Maestro, so that is the ordering it follows after 9999.
>
> On Jul 15, 2013, at 7:46 PM, Peter Cock wrote:
>> i.e. This software package? http://www.schrodinger.com/productpage/14/12/
>>
>> Could you contact their support to find out why they are doing this please?
>
> Yes, that's the Maestro Katrina was almost certainly talking about. It's a
> commercial package which has been around for a while; the company
> started in 1990 as a commercialization of the Jaguar QM package from
> Richard Friesner's and William Goddard's labs at CalTech. Maestro is
> the GUI to their QM and MM codes.
>
> Their conversion routines support various options. See:
>   https://www.schrodinger.com//AcrobatFile.php?type=supportdocs&type2=&ident=530
>
> The key ones are:
>
>   -hex : Use hexadecimal encoding for atom numbers greater
>     than 99999 and for residue numbers greater than 9999
>
> and
>
>   -hybrid36 : Use the hybrid36 scheme for atom serial numbers.
>     On input, integers of up to 6 digits and hexadecimal numbers are
>     recognized on ATOM records by default. On output, the default is
>     to use integers for less than 100 000 atoms, and hexadecimal for
>     100 000 atoms or more
>
>
> Annoyingly, as Robert Hanson reported in:
>   http://www.mailinglistarchive.com/html/jmol-users@lists.sourceforge.net/2013-01/msg00111.html
> (and see the thread at)
>   http://article.gmane.org/gmane.science.chemistry.blue-obelisk/1659/match=pdb+ok+who%27s+wise+guy
>
> their default output generates records like:
>
> ATOM  99998  H1  TIP3W3304     -28.543  60.673  40.064  1.00  0.00      WT5  H
> ATOM  99999  H2  TIP3W3304     -27.773  60.376  41.353  1.00  0.00      WT5  H
> ATOM  186a0  OH2 TIP3W3305     -24.713  61.533  47.372  1.00  0.00      WT5  O
> ATOM  186a1  H1  TIP3W3305     -25.652  61.772  47.519  1.00  0.00      WT5  H
> ATOM  186a2  H2  TIP3W3305     -24.713  61.625  46.379  1.00  0.00      WT5  H
>
> which means there can be two atoms with serial numbers "18700" (or
> "99999", etc) in the same file, with different meanings of what those
> numbers really mean.
>
> This obviously messes up all of the other PDB annotations which use
> a serial id, but I presume that most Maestro user only use PDB files
> for coordinate data, and not for the other fields.
>
> Maestro is the only program I know of which uses this awful form. A
> default enabling of the "-hybrid36" option (first-digit-is-in-base-36)
> would make it more consistent with tools in the X-PLOR/VMD
> heritage does, where A0000 follows 99999. Presumably they want
> the full 1,048,575 atom range.
>
>
>> If there are guidelines in the PDB specification for when this field overflows
>> I missed them, but it is a problem is there are rival hacks in common use
>> (roll-over/wrap-around versus this semi-hex scheme).
>
> There are no specs for how to handle more than 9999 residues,
> just like there are no specs for how to handle more than 99999 atoms.
>
> Cheers,
>
>
>                                 Andrew
>                                 dalke at dalkescientific.com

Thanks Andrew - useful background.

In the long run this problem should go away as the PDB moves
to using the The PDBx/mmCIF  format:
http://www.wwpdb.org/news/news_2013.html#22-May-2013

Peter



More information about the Biopython mailing list