[Bioperl-l] GO parsers & event driven parsing framework

Chris Mungall cjm@fruitfly.org
Wed, 15 May 2002 12:00:55 -0700 (PDT)


I'm rewriting the GO-text parsers (currently part of the GO software
toolkit on sourceforge) and will probably commit these to bioperl.

I'm using a lightweight event driven framework, and obviously it would
make sense to use the same framework for any rewrite of the current SeqIO
parsers. I'll outline my method, and will happily change it to fit into
another framework if anyone has any suggestions.

Initially I'd like to check in just the event-driven parsing part, and
think about graph/ontology object models later.

The GO-text parser (and other parsers following the framework) generate
nested events. I'm using the term "event" to be SAX friendly, but really
these are just trees.

Let's take an imaginary GO graph; GO-style structured controlled
vocabularies (see ftp://ftp.geneontology.org/pub/go/ontology) are often
stored in the gotext flatfile format. This is a somewhat ad-hoc format
that uses indentation to represent graphs as trees; multiple parentage is
either represented as duplicate subtrees or in the detail line.

$Gene_ontology ; GO:0000001
 %function ; GO:0000002
  %enzyme ; GO:0000003

Other more robust formats are possible (n3, rdf, oil, daml, etc) but
gotext is already prevalent and it is important to be able to parse it.
You get used to it after a while - the above means "enzyme" is a subtype
of "function" is a subtype of the "Gene_ontology" general type.

The GoText parser would zip through this, firing nested events in the
following structure:

[go
  [term
    [name 'function']
    [acc  'GO:0000002']
  ]
  [term
    [name 'enzyme']
    [acc  'GO:0000003']
    [relationship
      [type isa]
      [obj 'GO:0000002']
    ]
  ]
]

e.g.  the parsing code starts off by calling

$self->start_event("go")

when a new term is encountered, it fires this

$self->event("name", $name)

at the end it says

$self->end_event("go")

the event handler can be overridden to intercept any of these; the default
handler will just catch and nest all the events and return a tree as
above.

A similar event tree for an EMBL record may look something like this:

[embl-set
  [entry
    [locus Blah]
    [accession AFnnnnnn]
    [ftable
      [feature
        [type mRNA]
        [location
          [from 100]
          [to 200]
        ]
        [location
          [from 300]
          [to 400]
        ]
        [strain blah]
        [product shuggy]
...

I feel strongly that the default event mechanism should be as lightweight
as possible, using nested perl array references. Of course, one could
easily swap in a lightweight xml element generator, and then use xslt on
top of that. The above trees would just be attribute-less xml. We could
use dtds/xmlschema to describe the various formats at the event level, but
they would have to be attribute-less (this is consistent with ncbi xml
but not the biojava way of doing things).

I should be upfront about my bias here - xml and all it's associated
bloated technology is just lisp recapitulated in a shockingly bad way. I'm
strongly against introducing dependencies on any unneccessary xml
constructs for this framework. Of course, like it or not, xml and xslt are
important and well-supported technologies, so the framework should play
well with them, it just shouldn't be dependent on them.

Assuming I haven't alienated everyone already, what will the namespaces
look like?

Of course, the interface to SeqIO et al should remain exactly the same. If
you want objects from a file, you don't care one way or another whether
the underlying framework is event driven or not.

So, how about something like this:

Bio/
        Handlers/
                HandlerI
                BaseHandler
                XmlOutHandler
        SeqIO/
                EMBL
                GenBank
		Swiss
                Parsers/
                        EMBLParser
                        GenBankParser
			SwissParser
                Handlers/
                        EMBL2GenericSeqRecord
                        GenBank2GenericSeqRecord
                        GenericSeqRecord2Obj
                        GenericSeqRecord2Fasta
	OntologyIO/
		GoText
		Parsers/
			GoTextParser
			GoRdfParser
		Handlers/
			Go2Obj
        SearchIO/
                Parsers/
                        Blast
                        Blastxml

The format-specific handler classes inherit from BaseHandler, and turn the
events into objects. Essentially they do the same thing as xslt, but in
perl. perl can do a reasonable job of impersonating lisp, so this is ok.
Or you can just get the parser to generate xml elements and use an actual
xslt transformer. Or mix and match the bio* code.

bioperl-db could have it's own handlers for turning events straight into
insert statements.

Ok, maybe the best thing for me to do is commit the GO side of things and
let you all decide for yourself what the various merits etc are?

--
Chris