[BioRuby] Parsing large blastout.xml files

Fri Nov 6 03:11:32 UTC 2009

Another option is to use ruby-libxml reader.
http://libxml.rubyforge.org/rdoc/index.html  It reads the data
sequentially thus there is no memory overhead of first reading it all
in memory. However, then you would have to parse it from scratch.

On that note, maybe it is worth implementing Bio::Blast::Report.libxml
or something like that the same way there is Bio::Blast::Report.rexml
and Bio::Blast::Report.xmlparser. Dependecy to ruby-libxml in BioRuby
library was introducted in PhyloXML parser.

Diana

On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme <rob.syme at gmail.com> wrote:
> I'm trying to extract information from a large blast xml file. To parse the
> xml file, ruby reads the whole file into memory before looking at each
> entry. For large files (2.5GBish) - the memory requirements become severe.
>
> My first approach was to split each query up into its own <BlastOutput> xml
> instance, so that
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> Would end up looking more like:
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> their own file:
>
> $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
>
> Now each file can be parsed individually. I feel like there has to be an
> easier way. Is there a way to parse large xml files without huge memory
> overheads, or is that just par for the course?
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>