[Biojava-l] Massaging multi-query BLAST XML output...
Matthew Pocock
matthew_pocock at yahoo.co.uk
Thu Jul 10 10:40:09 EDT 2003
Is there some individual responsible for these bioinf apps emmitting
badly-formed XML that we can shoot very publicly at a conference as a
warning to all others?
(grins)
Matthew
David Huen wrote:
> On Wed, 9 Jul 2003, DeAngelo Lampkin wrote:
>
>
>>Hi guys,
>>
>>First of all, thanks to Keith and Matthew on the assist with the last question. And to the rest, shame on you all for not helping sooner! :)
>
>
> Fast paid-for service and development can hopefully be obtained from
> Biojava Consulting. But as this is a voluntary sideline on my part, I
> suppose I do not feel particularly obliged to jump when told to do so...
>
> Alternatively, support is >sometimes< available on irc.freenode.net:6667
> channel #biojava. That is, of course, when we are not just messing
> around. A fair proprtion of the traffic is development-related.
>
>
>>So now for my newest question concerning parsing Blast XML files;
>>specifically the mangled XML file that come out as a result of a
>>multiple query FASTA file search . I read the JavaDoc on
>>BlastXMLParser
>>and it made reference to a shell script (blast_aggregate) that massages
>>the XML output into something that is, you know, *legal* XML. Is
>>this an actual script floating around somewhere or was it something
>>put in for illustrative purposes only? I could do it myself,
>>I suppose, but while I like wheels as much as the next guy,
>>I try to avoid reinventing them when possible.
>>
>
> This was the script used when this parser was much younger:-
> #!/bin/sh
> # Converts a Blast XML output to something vaguely well-formed
> # for parsing.
> # Use: blast_aggregate <XML output> <editted file>
>
> # strips all <?xml> and <!DOCTYPE> tags
> # encapsulates the multiple <BlastOutput> elements into <blast_aggregator>
>
>
> sed '/<?xml/d' $1 | sed '/<!DOCTYPE/d' | sed '1i\
> <blast_aggregate>
> $a\
> </blast_aggregate>' > $2
>
> ==================
>
> Which I wrote and used with a very early incarnation of this parser. The
> trouble is I don't even understand it anymore. I think it just
> sequentially strips each element in a hacky way and then prepend and
> append the root element. Since then, the default
> SAX parser has changed and DTDs are now used in the parsing so i don't
> know whether the output will still work as expected. I know the single
> copy case still works as I did a demo for that one very recently.
> (note that the required element is <blast_aggregate>, not
> <blast_aggregator>, javadocs have just been updated.)
>
>
> There was a DTD problem reported by jinchen:-
> http://biojava.org/pipermail/biojava-l/2003-June/003947.html
>
> which I have been unable to reproduce.
>
> There is also another masseur available:-
> http://biojava.org/pipermail/biojava-l/2003-June/003933.html
>
> but I don't think this one wraps it the same way. I haven't checked to
> see if the output works just the same though. It should not be difficut
> to modify it to do the same thing as the other script if that is a good
> thing.
>
> I haven't used this code in anger since Nov last year. I should point out
> that not all parts of the blast output XML are used: not all of it can be
> force-fitted to the DTD standard we use for the interface code. All that
> was discussed on the ML last month.
>
> I would strongly recommend using the BlastXMLParserFacade class instead of
> the BlastXMLParser class unless setting up StAX parsers appeals to you.
> There is a demo of the use of the latter in the biojava-live repository.
>
> D.H.
>
>
> _______________________________________________
> Biojava-l mailing list - Biojava-l at biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>
--
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk
More information about the Biojava-l
mailing list