[Biopython-dev] PIR parsing

Wed Dec 6 03:16:16 EST 2000

I've written a much more complete PIR CODATA parser which works with
the latest PIR release (Release 66.00, September 30, 2000).  I tested
it against pir1.dat and pir3.dat.

The PIR format is somewhat nasty, but not as bad as I thought it would
be.  It's like several other formats in that long fields fold over to
the next (indented) lines.  The only major problem was that the folded
lines themselves can contain multiple elements, like

FEATURE
   2-105               #product cytochrome c #status experimental #label
                       MAT\

 or in XML with some extra newlines ... :)

   <feature_range><begin_pos>2</begin_pos>-<end_pos>105</end_pos>
</feature_range>               #product <product>cytochrome c
</product> #status <status>experimental</status> #label
                       <label>MAT\</label>

Some of the fields don't have the #elements at all, but the
implementation is pretty strict and it checks that words inside of the
text field do not start with a '#'.  That check makes the pattern
quite gnarly but is needed to ensure I'm not missing an element by
accident.

The new module is (temporarily) at
    http://www.biopython.org/~dalke/PIR_3_0.py  .
It should work fine, but it hasn't been tested against a real need
(like generating HTML or data structures) so will likely changed
as those needs are resolved.  Also, the indentation level has
changed from release 65 so it probably won't work with anything
other than the most recent version.

Some things to do for the future:
  o rewrite to clean things up, now that the format is known (some
       of the definitions are scaffolding to explore the format)
  o choose better names
  o parse more of the format
      - identify parts of the journal references
      - make each component accessible in a semi-colon delimited list

BTW, the callback overhead for this format is about a factor of
4 more than the parsing part.  The PIR format intermingles sequence
letters and markup about the residue - one letter of one then one
letter of the other.  So every sequence character creates three
function calls!  (begin, character and end.)

                    Andrew
                    dalke at acm.org