[BioRuby] Parsing large blastout.xml files

Fri Nov 6 03:17:02 UTC 2009

You might want to try a SAX Parser instead.

REXML from the standard library has a streaming API.  LibXML is a lot faster
and it's available as a gem.

http://libxml.rubyforge.org/

On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme <rob.syme at gmail.com> wrote:

> I'm trying to extract information from a large blast xml file. To parse the
> xml file, ruby reads the whole file into memory before looking at each
> entry. For large files (2.5GBish) - the memory requirements become severe.
>
> My first approach was to split each query up into its own <BlastOutput> xml
> instance, so that
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> Would end up looking more like:
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> their own file:
>
> $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
>
> Now each file can be parsed individually. I feel like there has to be an
> easier way. Is there a way to parse large xml files without huge memory
> overheads, or is that just par for the course?
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>