[Bioperl-l] genpept/swiss

Tue, 5 Sep 2000 05:42:55 -0600

Kris Boulez <krbou@pgsgent.be>
>The problem now is that BioPerl throws a warning if a sequence does not
>comply 100% with the standards and exits. While at that moment I want to
>be able to say that he can ignore the warning if (e.g.) he has read the
>sequence correctly.

Hilmar Lapp:
>Does this sound like a call for a callback a client program can
>provide? The question then is what should be passed to the callback
>routine? The sequence object as it has been constructed so far? Sounds
>fragile, and may be useless in many cases. The complete offending
>source record? Would discard the parse done so far (for the callback),
>and would require a partial rewrite of the parsers because they read
>line-by-line (at least most if not all of the rich format parsers).

Yes, it does.

I forgot to mention the way SAX parsers for XML handle problems.
There are three callback objects associated with the parser; the
document handler, error handler and content handler.  The first
deals with the characters and XML elements in the document.  The second
handles errors, and the third isn't relevant for this discussion.

The error handler has three callback methods, called 'warning', 'error'
and 'fatalError'.  They all take a single parameter, which is an
object which can be queried for information and, if desired, thrown.
The parser stops when any exceptions are raised in the callback.
The 'error' callback is used when the error is recoverable (as with
this tag=value problem) while fatalError is used for unrecoverable
errors.  The parser must stop parsing after a fatalError.

So I do like callbacks for processing.

Bioperl combines the parser and the doc and error handlers, which
lowers the complexity of the problem but also reduces the flexibility.
Let's consider the idea of supporting a SAX-like callback.

What should be passed in the callback?  The SAX interface takes
a single object, which describes the error.  (This assumes typed
objects being passed, so the callback can figure out which error
occured.)  The object can optionally be queried to get position
information about where the error occured.

The partially formed record is not passed into the object.  This is
because SAX isn't building the record, which is instead done inside
of the DocumentHandler.  In other words, it punts on Hilmar's question
about passing the record to the error handler since it isn't in charge
of maintaining the record.  Instead, the error handler and the document
handler have to be told before hand to work with each other.

If the record isn't passed in, then the callback won't be able to do
some sorts of error reporting, like listing the ID of the record which
failed.  On the other hand, that can't be done now, which means it
hasn't been a real problem.  Plus, it's possible to define that a
special object (not necessarily the sequence object) be passed as part
of the error object.  Eg, this could pass in the ID.

So I would suggest adding an optional callback to bioperl (either a
single function or an object with three methods like SAX).  This
callback gets the warning and error information, with either 3 or 4
pieces of information:
 - severity (eg, warn, error, fatalError; either as a string or
    specific method)
 - type of error (eg, a common string description or a typed object)
 - state for the error (eg, some description about the error either as an
    anonymous hash or part of the type error object.  This could include
    the sequence ID, if available, though I feel the amount of data should
    be limited.  Also, I feel it must act like a read-only object.)
 - (optional) position in the file where the error occured

This doesn't change the existing API or code since the error handler is
optional.  Though this does mean the default error and fatalError handlers
raise the error, to be consistent with current usage.  This is different
than SAX where the default handler silently ignores errors.  (I don't like
that behaviour anyway :)

You don't need to have your classes derived from the XML::SAX ones.  Indeed,
that likely adds a dependency you don't want.  You don't even need to use
the same names and callback parameters.  Instead, this suggestion is meant
to point out a standardized way of error handling which is pretty usable
and I believe applicable to this problem.  By using existing interfaces,
even just at the semantic level, you reduce the overhead of having to
learn new APIs all the time.

                    Andrew
                    dalke@acm.org