[Biojava-l] blast xml parser

Wiepert, Mathieu Wiepert.Mathieu@mayo.edu
Fri, 8 Jun 2001 07:35:26 -0500


My 2 cents...

Thank you for pointing out jaxb, that looks like just what I need at the
moment :)

In regards to your other comments, I ditto Simon on the use of the SAX
framework.  Saved me tons of time.  When the Biojava SAX components were
first written, I believe there was no XML format for BLAST outputs from any
program.  When I was adding a little functionality, XML just came to NCBI as
I was doing it, and GCG didn't have it yet.  Now that these things exist,
you may not even need the Biojava SAX parser if you are comfortable with
XSLT.   The uses I saw with parsing BLAST was to get interesting bits from a
file to build a datamining tool.  I saw my possibilities for dealing with
Blast output as, among other things, 
- a content handler in java with Biojava SAX2 compliant parser and text
Blast file
- a content handler in java with SAX2 compliant parser and XML Blast file
- a stylesheet in java with XALAN XSLT processor
- standalone XSLT processor like Saxon against text Blast files with Biojava
SAX parser plugged in
- standalone XSLT processor like Saxon against XML BLAST files.  

This list is not exhaustive, I am sure, and there are different reasons
people might want to use them.  One reason to go with plain SAX rather than
XSLT, as Simon has pointed out to me before, is if you have very large blast
files (and I do), using XSLT is not great.  It usually tries to instantiate
your whole document in memory.  A sax parser is then just the trick.  There
are ways around this, but I have not explored them.

I can certainly see possibilities to take blast output (in either form, text
or XML), and constitute Biojava objects with direct binding, using jaxb, if
that is what it can do.  Al the java solutions above could use that quite
nicely.  So, who wants to volunteer to look into this? :)


-mat

 -----Original Message-----
From: 	Simon Brocklehurst [mailto:simon.brocklehurst@CambridgeAntibody.com]

Sent:	Friday, June 08, 2001 3:34 AM
To:	xling@tularik.com
Cc:	biojava-l@biojava.org
Subject:	Re: [Biojava-l] blast xml parser

Hi,

xling wrote:

> Hi,
>
> I just came back from San Francisco JAVA ONE conference.  One of the talk
is
> about xml java binding.
>
> Sun has just released the  http://java.sun.com/xml/jaxb/index.html which
is
> trying to do the xml java binding.
>
> This makes me think of the biojava blast parser.  I have to admit that
there
> is some significant learning curve for me to get comfortable with the
> biojava SAX parser. Even after I have stepped through the parser
> implementation code and knows exactly how the implementation works, still
I
> found it really kind of little help in doing the blast parsing (extract
> alignment, start and end etc information) if no further work is devoted.

There shouldn't be any learning curve at all - if you understand SAX, that
is.

> Compare to bioperl, the demo code is just a "proof of concept" rather than
> the implementation library can be of real use.  Correct me if I am wrong.

You are wrong.   The biojava Blast-like parsing package is scalable ways
that
the bioperl library is not.   I think you have missed the whole point of
SAX,
which is to provide an event-based parsing framework.  This separates out
the
parsing step from the instantiation of objects.   The benefit of this is
enourmous, if you want to do serious things with the output of programs - it
means you can use objects suited for a particular purpose, and spend only a
few
minutes writing the code to instantiate them (using SAX), rather than days
or
even weeks (if parsing is linked to object instantiation).

You may also want to do XSLT manipulations of the data - the parse plugs
right
into this (thanks Mat and Harold!).

So the point is, yes of cource you *need* objects.  But not everyone needs
the
*same* objects.  There exists a bunch of objects in biojava designed for the
purpose of holding results from searches, so if you don't have special needs
of
your own, then you could probably use these (Keith has used these to hold
the
results from FASTA searches).

> Thus far biojava has not provided utilities in xml binding as it is not
from
> xml but from raw blast result and use SAX parser as a general tool to do
the
> parsing. The objects binding part after parsing is missing.  I am not sure
> anyone in the mailing list has really put the biojava sax  blast parser in
> real pratice. If you have done this, please share your experience with me.
>
> In the past, this may make sense.  But now ncbi blast utility can provide
> xml format result. I think biojava really should embrace this to do the
xml
> binding to the current biojava objects.

There is nothing to stop you from doing this. NCBI Blast provides XML
output.
What more is there to do?  What biojava provides is a common XML format that
lets you exploit results from: NCBI Blast, WU-Blast, Fasta (thanks Keith!),
and
HMMER (full support coming in a couple of days).

This gives significant potential for code re-use.  For example, code for
HTML
visualisation (again coming in a couple of days).  But also other event
handlers.

>  Or use jaxb package to instantiate
> intermediate objects.  For parsing purposes, the intermediate objects may
be
> good enough. I am thinking about trying this when I have some spare time
to
> take a spin on the jaxb stuff.
>
> Please comment on this. As far as I am concerned, the bioinformatics
result
> I/O parsing including blast and similar tools is kind of critical.

JAXB is cool, what can I say.   You can trivially create XML using the
biojava
framework - we should have an XML Schema to replace the current DTD soon,
and
thus you should be able to use JAXB with biojava.  I'm not sure how JAXB
scales
to large datasets, haven't looked into it.

Simon
--
Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics & Advanced IS
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/
mailto:simon.brocklehurst@CambridgeAntibody.com


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l