[Biojava-l] looking for datafile parsers

Simon Brocklehurst simon.brocklehurst@CambridgeAntibody.com
Thu, 11 Jan 2001 13:59:22 +0000

Hi Andrew,

You might be interested to know that CAT has contributed to biojava a
SAX2-compliant, event-based parsing framework for dealing with
bioinformatics data files.

Essentially, by using a SAX2 model, the framework allows users to build
arbritrary XML Content Handlers for dealing with data from
bioinformatics files in arbritary ways.  The framework generates SAX2
events from bioinformatics format files i.e. the input data isn't XML,
nor is it converted into XML internally.

It's a reasonable implementation of the SAX2 e.g. Users can:

o Set properties on SAX Parsers e.g. configuration of various features
namespace reporting etc.

o Handle infinitely large files, because it works like a SAXParser
should i.e. doesn't keep the whole file in memory etc.

o Deals with InputSources i.e. essentially various flavours of streams.

A couple of neat benefits of a implementationing of SAX2:

o It's trivial to create XML format versions of files so, with which you
can do whatever you want with these e.g. using XSLT

o By stringing together biojava SAXParsers which are non-validating,
with validating SAXParsers from e.g. IBM, you can create parsers that
validate against DTDs and/or XML Schemas that we produce for the data
formats supported by the framework.  Because, the bioinforamtics data
from is modelled in a strongly typed way by the framework, you can get
genuinely useful benefits from validation.

We haven't put SwissProt support into this framework as of yet -
biojava already had ways of handling SwissProt data before we put the
SAX2 framework in.  Currently we have in there OK support for NCBI Blast
and WU-Blast, and improving support for HMMER, and PDB data.

Hope this info is useful...

Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics & Advanced IS
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK