[Bioperl-l] GO parsers & event driven parsing framework
Matthew Pocock
matthew_pocock@yahoo.co.uk
Wed, 15 May 2002 22:44:21 +0100
Hi Chris,
Good luck with this. Anything that makes parsing streams of stuff easier
is good IMHO.
The biojava tag-value parser does a very similar job to this. Briefly,
it has start/endRecord, start/endTag and value methods. The whole
document is wrapped in a start/endRecord pair. Within this, any number
of start/endTag pairs can be fired. Within this, any number of values
could be fired (e.g. one per list element). Also, if there is
sub-structure, a start/endRecord pair may be fired within a tag scope.
This then can contain its own tag events with value or record events and
so on.
There is a standard listener that consues these events and builds a tree
of Annotation objects. The tag-value stream is seen as just that - a
stream of observations. The static representation is totaly decoupled (A
tree of Annotation objects being one possible representation).
There are standard file parsers which emit these events for embl- and
genbank-like formats.
We have a few usefull listeners that forward events down a chain but
mutate or intercept the events (e.g. split a single value into multiple
values by splitting on comma, or drop all sub-trees under a given tag,
replace values with newly built objects).
I think it should translate quite cleanly to perl, if this is the sort
of thing you want. Regular expression transducing listeners make
short-work of most of the mess we encounter daily.
Matthew
Chris Mungall wrote:
> I'm rewriting the GO-text parsers (currently part of the GO software
> toolkit on sourceforge) and will probably commit these to bioperl.
>
> I'm using a lightweight event driven framework, and obviously it would
> make sense to use the same framework for any rewrite of the current SeqIO
> parsers. I'll outline my method, and will happily change it to fit into
> another framework if anyone has any suggestions.
>
> Initially I'd like to check in just the event-driven parsing part, and
> think about graph/ontology object models later.
>
> The GO-text parser (and other parsers following the framework) generate
> nested events. I'm using the term "event" to be SAX friendly, but really
> these are just trees.
>
> Let's take an imaginary GO graph; GO-style structured controlled
> vocabularies (see ftp://ftp.geneontology.org/pub/go/ontology) are often
> stored in the gotext flatfile format. This is a somewhat ad-hoc format
> that uses indentation to represent graphs as trees; multiple parentage is
> either represented as duplicate subtrees or in the detail line.
>
> $Gene_ontology ; GO:0000001
> %function ; GO:0000002
> %enzyme ; GO:0000003
>
> Other more robust formats are possible (n3, rdf, oil, daml, etc) but
> gotext is already prevalent and it is important to be able to parse it.
> You get used to it after a while - the above means "enzyme" is a subtype
> of "function" is a subtype of the "Gene_ontology" general type.
>
> The GoText parser would zip through this, firing nested events in the
> following structure:
>
> [go
> [term
> [name 'function']
> [acc 'GO:0000002']
> ]
> [term
> [name 'enzyme']
> [acc 'GO:0000003']
> [relationship
> [type isa]
> [obj 'GO:0000002']
> ]
> ]
> ]
>
> e.g. the parsing code starts off by calling
>
> $self->start_event("go")
>
> when a new term is encountered, it fires this
>
> $self->event("name", $name)
>
> at the end it says
>
> $self->end_event("go")
>
> the event handler can be overridden to intercept any of these; the default
> handler will just catch and nest all the events and return a tree as
> above.
>
> A similar event tree for an EMBL record may look something like this:
>
> [embl-set
> [entry
> [locus Blah]
> [accession AFnnnnnn]
> [ftable
> [feature
> [type mRNA]
> [location
> [from 100]
> [to 200]
> ]
> [location
> [from 300]
> [to 400]
> ]
> [strain blah]
> [product shuggy]
> ...
>
> I feel strongly that the default event mechanism should be as lightweight
> as possible, using nested perl array references. Of course, one could
> easily swap in a lightweight xml element generator, and then use xslt on
> top of that. The above trees would just be attribute-less xml. We could
> use dtds/xmlschema to describe the various formats at the event level, but
> they would have to be attribute-less (this is consistent with ncbi xml
> but not the biojava way of doing things).
>
> I should be upfront about my bias here - xml and all it's associated
> bloated technology is just lisp recapitulated in a shockingly bad way. I'm
> strongly against introducing dependencies on any unneccessary xml
> constructs for this framework. Of course, like it or not, xml and xslt are
> important and well-supported technologies, so the framework should play
> well with them, it just shouldn't be dependent on them.
>
> Assuming I haven't alienated everyone already, what will the namespaces
> look like?
>
> Of course, the interface to SeqIO et al should remain exactly the same. If
> you want objects from a file, you don't care one way or another whether
> the underlying framework is event driven or not.
>
> So, how about something like this:
>
> Bio/
> Handlers/
> HandlerI
> BaseHandler
> XmlOutHandler
> SeqIO/
> EMBL
> GenBank
> Swiss
> Parsers/
> EMBLParser
> GenBankParser
> SwissParser
> Handlers/
> EMBL2GenericSeqRecord
> GenBank2GenericSeqRecord
> GenericSeqRecord2Obj
> GenericSeqRecord2Fasta
> OntologyIO/
> GoText
> Parsers/
> GoTextParser
> GoRdfParser
> Handlers/
> Go2Obj
> SearchIO/
> Parsers/
> Blast
> Blastxml
>
> The format-specific handler classes inherit from BaseHandler, and turn the
> events into objects. Essentially they do the same thing as xslt, but in
> perl. perl can do a reasonable job of impersonating lisp, so this is ok.
> Or you can just get the parser to generate xml elements and use an actual
> xslt transformer. Or mix and match the bio* code.
>
> bioperl-db could have it's own handlers for turning events straight into
> insert statements.
>
> Ok, maybe the best thing for me to do is commit the GO side of things and
> let you all decide for yourself what the various merits etc are?
>
> --
> Chris
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>