[BioRuby] Parsing large blastout.xml files

Fri Nov 6 08:58:15 UTC 2009

Diana is right. We need to revamp the implementation for big results.
Not only that, the current implementation has method names do not
match the BLAST names. I need something like this pretty soon and
was thinking of writing it.

Pj.

On Thu, Nov 05, 2009 at 10:11:32PM -0500, Diana Jaunzeikare wrote:
> Another option is to use ruby-libxml reader.
> http://libxml.rubyforge.org/rdoc/index.html  It reads the data
> sequentially thus there is no memory overhead of first reading it all
> in memory. However, then you would have to parse it from scratch.
> 
> On that note, maybe it is worth implementing Bio::Blast::Report.libxml
> or something like that the same way there is Bio::Blast::Report.rexml
> and Bio::Blast::Report.xmlparser. Dependecy to ruby-libxml in BioRuby
> library was introducted in PhyloXML parser.
> 
> Diana
> 
> On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme <rob.syme at gmail.com> wrote:
> > I'm trying to extract information from a large blast xml file. To parse the
> > xml file, ruby reads the whole file into memory before looking at each
> > entry. For large files (2.5GBish) - the memory requirements become severe.
> >
> > My first approach was to split each query up into its own <BlastOutput> xml
> > instance, so that
> >
> > <BlastOutput>
> >  <BlastOutput_iterations>
> >    <Iteration>
> >      <Iteration_hits>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >      </Iteration_hits>
> >    </Iteration>
> >    <Iteration>
> >      <Iteration_hits>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >      </Iteration_hits>
> >    </Iteration>
> >    <Iteration>
> >      <Iteration_hits>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >      </Iteration_hits>
> >    </Iteration>
> >  </BlastOutput_iterations>
> > </BlastOutput>
> >
> > Would end up looking more like:
> > <BlastOutput>
> >  <BlastOutput_iterations>
> >    <Iteration>
> >      <Iteration_hits>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >      </Iteration_hits>
> >    </Iteration>
> >  </BlastOutput_iterations>
> > </BlastOutput>
> >
> > <BlastOutput>
> >  <BlastOutput_iterations>
> >    <Iteration>
> >      <Iteration_hits>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >      </Iteration_hits>
> >    </Iteration>
> >  </BlastOutput_iterations>
> > </BlastOutput>
> >
> > <BlastOutput>
> >  <BlastOutput_iterations>
> >    <Iteration>
> >      <Iteration_hits>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >        <Hit></Hit>
> >      </Iteration_hits>
> >    </Iteration>
> >  </BlastOutput_iterations>
> > </BlastOutput>
> >
> > Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> > their own file:
> >
> > $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
> >
> > Now each file can be parsed individually. I feel like there has to be an
> > easier way. Is there a way to parse large xml files without huge memory
> > overheads, or is that just par for the course?
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby