[Biojava-l] Massaging multi-query BLAST XML output...

David Huen smh1008 at cus.cam.ac.uk
Thu Jul 10 02:12:24 EDT 2003


On Wed, 9 Jul 2003, DeAngelo Lampkin wrote:

> Hi guys,
> 
> First of all, thanks to Keith and Matthew on the assist with the last question.  And to the rest, shame on you all for not helping sooner! :)

Fast paid-for service and development can hopefully be obtained from
Biojava Consulting. But as this is a voluntary sideline on my part, I
suppose I do not feel particularly obliged to jump when told to do so...

Alternatively, support is >sometimes< available on irc.freenode.net:6667
channel #biojava.  That is, of course, when we are not just messing
around. A fair proprtion of the traffic is development-related.

> 
> So now for my newest question concerning parsing Blast XML files; 
> specifically the mangled XML file that come out as a result of a 
> multiple query FASTA file search .   I read the JavaDoc on
> BlastXMLParser 
> and it made reference to a shell script (blast_aggregate) that massages 
> the XML output into something that is, you know, *legal* XML. Is 
> this an actual script floating around somewhere or was it something 
> put in for illustrative purposes only?  I could do it myself, 
> I suppose, but while I like wheels as much as the next guy, 
> I try to avoid reinventing them when possible.  
> 
This was the script used when this parser was much younger:-
#!/bin/sh
# Converts a Blast XML output to something vaguely well-formed
# for parsing.
# Use: blast_aggregate <XML output> <editted file>

# strips all <?xml> and <!DOCTYPE> tags
# encapsulates the multiple <BlastOutput> elements into <blast_aggregator>


sed '/<?xml/d' $1 | sed '/<!DOCTYPE/d' | sed '1i\
<blast_aggregate>
$a\
</blast_aggregate>' > $2

==================

Which I wrote and used with a very early incarnation of this parser.  The
trouble is I don't even understand it anymore.  I think it just
sequentially strips each element in a hacky way and then prepend and 
append the root element. Since then, the default
SAX parser has changed and DTDs are now used in the parsing so i don't
know whether the output will still work as expected.  I know the single
copy case still works as I did a demo for that one very recently.
(note that the required element is <blast_aggregate>, not
<blast_aggregator>, javadocs have just been updated.)


There was a DTD problem reported by jinchen:-
http://biojava.org/pipermail/biojava-l/2003-June/003947.html

which I have been unable to reproduce.

There is also another masseur available:-
http://biojava.org/pipermail/biojava-l/2003-June/003933.html

but I don't think this one wraps it the same way.  I haven't checked to
see if the output works just the same though.  It should not be difficut
to modify it to do the same thing as the other script if that is a good
thing.

I haven't used this code in anger since Nov last year.  I should point out
that not all parts of the blast output XML are used: not all of it can be
force-fitted to the DTD standard we use for the interface code.  All that
was discussed on the ML last month.

I would strongly recommend using the BlastXMLParserFacade class instead of
the BlastXMLParser class unless setting up StAX parsers appeals to you.
There is a demo of the use of the latter in the biojava-live repository.

D.H.




More information about the Biojava-l mailing list