[Biojava-l] new balst outputs XML

Simon Brocklehurst simon.brocklehurst@CambridgeAntibody.com
Fri, 11 Aug 2000 20:03:54 +0100


Gerald Loeffler wrote:

> hi!
>
> The new NCBI-web-blast-release (available since yesterday for download
> only for now) can output XML according to the attached DTD. I expect
> that this feature will be available for the command-line version of
> blast soon. The benefits of this i think are obvious...
>
> We should discuss what this means to our own blast-parsing efforts. I
> will definitely build java support (rather sooner than later) for this
> type of XML output and would contribute it to biojava if there is
> interest...
>
> Another question (to Simon) is, how similar the NCBI DTD is to the DTD
> used by the SAX-type-blast-parser.

Hi!

Firstly, a discussion of the biojava Blast-like parsing framework - now seems
a good time. Secondly a comparison between NCBI and biojava DTDs.

The ideas behind / features of  the Blast-like parsing framework which spits
out SAX events a la biojava:BlastLikeDataSetCollection DTD are as follows:

o Support for multiple bioinformatics programs e.g. NCBI-Blast, WU-Blast,
HMMER, DBA etc. The version we're putting in on Monday supports NCBI-Blast,
Wu-Blast and has partial support for HMMER (complete support coming soon).

The benefit of using a common XML model for all the output from these programs
is that it significantly reduces the amount of Java, XSLT, HTML you have to
write, maintain etc.  For example, if you can visualize/interact/persist  the
output from one piece of software supported by the framework, you can do the
same for all supported programs in the framework for free.

o It is supposed to be relatively stable (when it gets into a biojava
release).

The benefit of this is that when the output format of the native
bioinformatics software changes from version to version as it tends to do, you
don't need to change any of the code code of your Java applications that use
this software.  For example, Java ContentHandlers, value objects, database
persistance, analysus XSLT/HTML visualization etc. i.e. all your computer
system just keep on working.

In cases where software produces XML output (which is obviously the ideal from
a robustness point of view) such as the new Blast XML output, then life is
easier from the point of view of maintaining the framework things are really
simple - it's just a matter of writing an XSLT transform to convert formats
rather than writing a new Java parser.

o It elegantly model the data.

Benefits are that XML conforming to the biojava DTD can in principle represent
all the semantic meaning we could be interested in (not sure if this is 100%
yet, but it's good start) in an appropriate way.

For example, it is could be important to you to model different views of HSPs
e.g. have plus strand HSPs collected separately from minus strand HSPs. The
NCBI Blast DTD has no concept of multiple, distinct collections of HSPs within
a given hit (at least I don't think it has).

o The biojava community has control of the format.

The benefit of this is that we as a community are in control of whether we
want code based on the framework to break if we want to change anything.

Secondly, some a quick comparison (similarities and differences) between the
biojava:BlastLikeDataSetCollection DTD and the NCBI Blast DTD.

o Both model the details of the guts of Blast output so it looks like it
should be pretty easy to write an XSLT transform to convert.

o The NCBI DTD more completely models the output of Blast.  We haven't gotten
around to modelling some of the details at the ends of the output from these
programs e.g. statistics etc.

o The biojava DTD obviously can model the output from programs other than
NCBI-Blast.

o XML conforming to the biojava DTD is "compound document ready" i.e. uses
namespaces etc.  DTDs are a bit useless in this regard, but XML isn't so it's
kinda nice to be ready.  I expect that we will be able to move to schema
sooner, rather than later.

o The biojava DTD uses a mixture of elements and attributes to model data,
whereas the NCBI DTD uses all elements and no attributes.  In regard of this
age-old question, I have fairly strong views about appropriate and
inappropriate ways to model data in XML. Perhaps unsurprisingly, these are
reflected in the design choices in biojava DTD!  In the end, however, this is
simply a matter of opinion.

OK, enough for now.

Simon
--
Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics & Advanced IS
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/
mailto:simon.brocklehurst@CambridgeAntibody.com