[Biojava-l] Possible Submission

Simon Brocklehurst simon.brocklehurst@CambridgeAntibody.com
Mon, 15 Oct 2001 19:01:47 +0100

Robert Hubley wrote:

> I have developed a parsing framework called LSAX that I would
> like to submit to BioJava.  It was inspired by the work of Cambridge
> Antibody Technology (Simon Brocklehurst et al.) on the BioJava
> BlastLikeSaxParser.  The idea is the same -- create a bridge
> between XML applications and Non-XML data.  The difference
> between the CAT parser and LSAX is in the design of the raw
> file parser.  I use LEX (actually JFLEX) to tokenize the raw
> data files and generate Start, Data, and End SAX events.  I have
> developed two parsers using this framework an NCBI Blast and
> a Fasta parser.  The advantage to using LEX is that you can specify
> the rules of your parser at a high level with regular expressions.  The
> actual parser is then auto-generated using JFLEX and is often times
> faster than a parser you would write by hand.
> Let me know if you would like to include this in BioJava,


Sounds good to me - this sounds is a similar idea to the Andrew Dalke's
Martel package in biopython.  This would be a valuable edition to biojava I
think.  I don't know if your parsing framework does this already, but it
would be really cool it was SAX2 compliant (as opposed to SAX1).

I also have a question. Can it cope specifying formats that span multiple
lines?  Or is it limited to treating non-XML files as being essentially
record-based i.e. dealing with single lines at a time? Sometimes, it's
useful (and necessary) to be able to read ahead several lines of a file,
before actually parsing i.e. emitting SAX events.

Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics & Advanced IS
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK