[Biojava-l] blast xml parser

Simon Brocklehurst simon.brocklehurst@CambridgeAntibody.com
Fri, 08 Jun 2001 09:34:22 +0100


Hi,

xling wrote:

> Hi,
>
> I just came back from San Francisco JAVA ONE conference.  One of the talk is
> about xml java binding.
>
> Sun has just released the  http://java.sun.com/xml/jaxb/index.html which is
> trying to do the xml java binding.
>
> This makes me think of the biojava blast parser.  I have to admit that there
> is some significant learning curve for me to get comfortable with the
> biojava SAX parser. Even after I have stepped through the parser
> implementation code and knows exactly how the implementation works, still I
> found it really kind of little help in doing the blast parsing (extract
> alignment, start and end etc information) if no further work is devoted.

There shouldn't be any learning curve at all - if you understand SAX, that is.

> Compare to bioperl, the demo code is just a "proof of concept" rather than
> the implementation library can be of real use.  Correct me if I am wrong.

You are wrong.   The biojava Blast-like parsing package is scalable ways that
the bioperl library is not.   I think you have missed the whole point of SAX,
which is to provide an event-based parsing framework.  This separates out the
parsing step from the instantiation of objects.   The benefit of this is
enourmous, if you want to do serious things with the output of programs - it
means you can use objects suited for a particular purpose, and spend only a few
minutes writing the code to instantiate them (using SAX), rather than days or
even weeks (if parsing is linked to object instantiation).

You may also want to do XSLT manipulations of the data - the parse plugs right
into this (thanks Mat and Harold!).

So the point is, yes of cource you *need* objects.  But not everyone needs the
*same* objects.  There exists a bunch of objects in biojava designed for the
purpose of holding results from searches, so if you don't have special needs of
your own, then you could probably use these (Keith has used these to hold the
results from FASTA searches).

> Thus far biojava has not provided utilities in xml binding as it is not from
> xml but from raw blast result and use SAX parser as a general tool to do the
> parsing. The objects binding part after parsing is missing.  I am not sure
> anyone in the mailing list has really put the biojava sax  blast parser in
> real pratice. If you have done this, please share your experience with me.
>
> In the past, this may make sense.  But now ncbi blast utility can provide
> xml format result. I think biojava really should embrace this to do the xml
> binding to the current biojava objects.

There is nothing to stop you from doing this. NCBI Blast provides XML output.
What more is there to do?  What biojava provides is a common XML format that
lets you exploit results from: NCBI Blast, WU-Blast, Fasta (thanks Keith!), and
HMMER (full support coming in a couple of days).

This gives significant potential for code re-use.  For example, code for HTML
visualisation (again coming in a couple of days).  But also other event
handlers.

>  Or use jaxb package to instantiate
> intermediate objects.  For parsing purposes, the intermediate objects may be
> good enough. I am thinking about trying this when I have some spare time to
> take a spin on the jaxb stuff.
>
> Please comment on this. As far as I am concerned, the bioinformatics result
> I/O parsing including blast and similar tools is kind of critical.

JAXB is cool, what can I say.   You can trivially create XML using the biojava
framework - we should have an XML Schema to replace the current DTD soon, and
thus you should be able to use JAXB with biojava.  I'm not sure how JAXB scales
to large datasets, haven't looked into it.

Simon
--
Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics & Advanced IS
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/
mailto:simon.brocklehurst@CambridgeAntibody.com