[Biojava-l] Massaging multi-query BLAST XML output...

Thu Jul 10 11:42:24 EDT 2003

Thanks David and Mark,

Mark's email: 

>See the JavaDoc comments on
>org.biojava.bio.program.sax.blastxml.BlastOutputHandler for various
>options and a script you could use. Whew that's a long package name :)

Tried to look for this in the JavaDocs on the biojava website, but couldn't find the class(and after all that typing too!).  That's when I noticed that those were the docs for the pre-release of 1.3.  But the class was indeed in the 1.3 official release.

Now on to David's:

>Fast paid-for service and development can hopefully be obtained from
>Biojava Consulting. But as this is a voluntary sideline on my part, I
>suppose I do not feel particularly obliged to jump when told to do so...

No problem.  I realize everyone is busy and that this is purely voluntary, so the shame comment was meant in jest. :)  However, when I take over the world with genetically engineered T-rex's with the help of biojava, then I won't be joking any longer.  Oops, did I reveal my evil plan out loud again?

>This was the script used when this parser was much younger:-
>#!/bin/sh
># Converts a Blast XML output to something vaguely well-formed
># for parsing.
># Use: blast_aggregate <XML output> <editted file>
># strips all <?xml> and <!DOCTYPE> tags
># encapsulates the multiple <BlastOutput> elements into <blast_aggregator>
>sed '/<?xml/d' $1 | sed '/<!DOCTYPE/d' | sed '1i\
><blast_aggregate>
>$a\
></blast_aggregate>' > $2

This mostly does what needs to be done.  The only thing remaining technically is rewrapping the the newly aggregated structure with DOCTYPE and ?xml? tags, but that's no big deal.

FYI, the JavaDoc utility cuts off most of that script which is why I couldn't find it in the online documentation (which looks strangely incomplete as a result).  Everything down to the "<XML output> <editted file>" part shows up normally. After that, nothing else gets JavaDoc'd.  Is there a bug in Javadoc that anyone is aware of?  I didn't notice this until I downloaded the source and looked at the BlastXMLParser directly.

>Which I wrote and used with a very early incarnation of this parser.  The
>trouble is I don't even understand it anymore.

No problem.  I personally run in fear when I see a sed script.  
If necessary, I'm sure I can find someone who can interpret it at one of the local
nursing homes. :)

>There is also another masseur available:-
>http://biojava.org/pipermail/biojava-l/2003-June/003933.html
>but I don't think this one wraps it the same way.  I haven't checked to
>see if the output works just the same though.  It should not be difficut
>to modify it to do the same thing as the other script if that is a good
>thing.

The XMLPreprocessor program in that link turns tag values into attribute values, but doesn't actually do anything about the wrapping.

>I would strongly recommend using the BlastXMLParserFacade class instead of
>the BlastXMLParser class unless setting up StAX parsers appeals to you.
>There is a demo of the use of the latter in the biojava-live repository.

I looked everywhere for BlastXMLParserFacade, but I couldn't find it.  Do you happen to recall which package it sits in? 

Also, I'm not really sure how to go about using the StAX stuff.  For the time being, I'll just create my own data structures using vanilla SAX.

Thanks again!
DeAngelo

-----Original Message-----
From: David Huen [mailto:smh1008 at cus.cam.ac.uk]
Sent: Wednesday, July 09, 2003 5:12 PM
To: DeAngelo Lampkin
Cc: biojava-l at biojava.org
Subject: Re: [Biojava-l] Massaging multi-query BLAST XML output...

On Wed, 9 Jul 2003, DeAngelo Lampkin wrote:

> Hi guys,
> 
> First of all, thanks to Keith and Matthew on the assist with the last question.  And to the rest, shame on you all for not helping sooner! :)

Fast paid-for service and development can hopefully be obtained from
Biojava Consulting. But as this is a voluntary sideline on my part, I
suppose I do not feel particularly obliged to jump when told to do so...

Alternatively, support is >sometimes< available on irc.freenode.net:6667
channel #biojava.  That is, of course, when we are not just messing
around. A fair proprtion of the traffic is development-related.

> 
> So now for my newest question concerning parsing Blast XML files; 
> specifically the mangled XML file that come out as a result of a 
> multiple query FASTA file search .   I read the JavaDoc on
> BlastXMLParser 
> and it made reference to a shell script (blast_aggregate) that massages 
> the XML output into something that is, you know, *legal* XML. Is 
> this an actual script floating around somewhere or was it something 
> put in for illustrative purposes only?  I could do it myself, 
> I suppose, but while I like wheels as much as the next guy, 
> I try to avoid reinventing them when possible.  
> 
This was the script used when this parser was much younger:-
#!/bin/sh
# Converts a Blast XML output to something vaguely well-formed
# for parsing.
# Use: blast_aggregate <XML output> <editted file>

# strips all <?xml> and <!DOCTYPE> tags
# encapsulates the multiple <BlastOutput> elements into <blast_aggregator>

sed '/<?xml/d' $1 | sed '/<!DOCTYPE/d' | sed '1i\
<blast_aggregate>
$a\
</blast_aggregate>' > $2

==================

Which I wrote and used with a very early incarnation of this parser.  The
trouble is I don't even understand it anymore.  I think it just
sequentially strips each element in a hacky way and then prepend and 
append the root element. Since then, the default
SAX parser has changed and DTDs are now used in the parsing so i don't
know whether the output will still work as expected.  I know the single
copy case still works as I did a demo for that one very recently.
(note that the required element is <blast_aggregate>, not
<blast_aggregator>, javadocs have just been updated.)

There was a DTD problem reported by jinchen:-
http://biojava.org/pipermail/biojava-l/2003-June/003947.html

which I have been unable to reproduce.

There is also another masseur available:-
http://biojava.org/pipermail/biojava-l/2003-June/003933.html

but I don't think this one wraps it the same way.  I haven't checked to
see if the output works just the same though.  It should not be difficut
to modify it to do the same thing as the other script if that is a good
thing.

I haven't used this code in anger since Nov last year.  I should point out
that not all parts of the blast output XML are used: not all of it can be
force-fitted to the DTD standard we use for the interface code.  All that
was discussed on the ML last month.

I would strongly recommend using the BlastXMLParserFacade class instead of
the BlastXMLParser class unless setting up StAX parsers appeals to you.
There is a demo of the use of the latter in the biojava-live repository.

D.H.