[Biojava-l] Massaging multi-query BLAST XML output...

Thu Jul 10 10:40:09 EDT 2003

Is there some individual responsible for these bioinf apps emmitting 
badly-formed XML that we can shoot very publicly at a conference as a 
warning to all others?

(grins)

Matthew

David Huen wrote:
> On Wed, 9 Jul 2003, DeAngelo Lampkin wrote:
> 
> 
>>Hi guys,
>>
>>First of all, thanks to Keith and Matthew on the assist with the last question.  And to the rest, shame on you all for not helping sooner! :)
> 
> 
> Fast paid-for service and development can hopefully be obtained from
> Biojava Consulting. But as this is a voluntary sideline on my part, I
> suppose I do not feel particularly obliged to jump when told to do so...
> 
> Alternatively, support is >sometimes< available on irc.freenode.net:6667
> channel #biojava.  That is, of course, when we are not just messing
> around. A fair proprtion of the traffic is development-related.
> 
> 
>>So now for my newest question concerning parsing Blast XML files; 
>>specifically the mangled XML file that come out as a result of a 
>>multiple query FASTA file search .   I read the JavaDoc on
>>BlastXMLParser 
>>and it made reference to a shell script (blast_aggregate) that massages 
>>the XML output into something that is, you know, *legal* XML. Is 
>>this an actual script floating around somewhere or was it something 
>>put in for illustrative purposes only?  I could do it myself, 
>>I suppose, but while I like wheels as much as the next guy, 
>>I try to avoid reinventing them when possible.  
>>
> 
> This was the script used when this parser was much younger:-
> #!/bin/sh
> # Converts a Blast XML output to something vaguely well-formed
> # for parsing.
> # Use: blast_aggregate <XML output> <editted file>
> 
> # strips all <?xml> and <!DOCTYPE> tags
> # encapsulates the multiple <BlastOutput> elements into <blast_aggregator>
> 
> 
> sed '/<?xml/d' $1 | sed '/<!DOCTYPE/d' | sed '1i\
> <blast_aggregate>
> $a\
> </blast_aggregate>' > $2
> 
> ==================
> 
> Which I wrote and used with a very early incarnation of this parser.  The
> trouble is I don't even understand it anymore.  I think it just
> sequentially strips each element in a hacky way and then prepend and 
> append the root element. Since then, the default
> SAX parser has changed and DTDs are now used in the parsing so i don't
> know whether the output will still work as expected.  I know the single
> copy case still works as I did a demo for that one very recently.
> (note that the required element is <blast_aggregate>, not
> <blast_aggregator>, javadocs have just been updated.)
> 
> 
> There was a DTD problem reported by jinchen:-
> http://biojava.org/pipermail/biojava-l/2003-June/003947.html
> 
> which I have been unable to reproduce.
> 
> There is also another masseur available:-
> http://biojava.org/pipermail/biojava-l/2003-June/003933.html
> 
> but I don't think this one wraps it the same way.  I haven't checked to
> see if the output works just the same though.  It should not be difficut
> to modify it to do the same thing as the other script if that is a good
> thing.
> 
> I haven't used this code in anger since Nov last year.  I should point out
> that not all parts of the blast output XML are used: not all of it can be
> force-fitted to the DTD standard we use for the interface code.  All that
> was discussed on the ML last month.
> 
> I would strongly recommend using the BlastXMLParserFacade class instead of
> the BlastXMLParser class unless setting up StAX parsers appeals to you.
> There is a demo of the use of the latter in the biojava-live repository.
> 
> D.H.
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

-- 
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk