[BioRuby] Blast parsing speed
Moses M. Hohman
mmhohman at northwestern.edu
Wed Sep 27 06:18:45 UTC 2006
Sounds like bioruby is reading the entire DOM tree of the blast
output XML into memory (with all the paging, etc.). That looks like
what's happening in bio/appl/blast/rexml.rb. It looks like if you
have the xmlparser library installed (http://raa.ruby-lang.org/
project/xmlparser/), which is a SAX parser, it will use that, and
that should solve you problem.
We might want to look into using a pull parser instead of a DOM
parser, i.e. in Ruby use rexml/parsers/pullparser instead of the
rexml/document. Pull parsers are nice because they are as memory-
efficient as SAX parsers but allow you to use a more familiar
procedural programming style rather than an event-driven style (like
So, it's less an issue of the programming language, and more of the
type of XML parser.
Hope that helps, it's a guess but I think it's probably what you're
On Sep 24, 2006, at 6:28 AM, Yannick Wurm wrote:
> I have been happily using bioruby for the past year or so for my post-
> blast analyses. Occasionally, I will have ~ 1gb blast result files
> that need to be parsed. Here my machine may start paging and slows to
> a crawl.
> Thus I wonder:
> - has anyone benchmarked bioruby, bioperl, biojava, biopython when
> processing the same file to compare speed and memory usage?
> - For the sake of future compatibility, I have been use blast's xml
> output. How much slower is it is to parse such an xml file relative
> to a "normal" or tabular blast output?
> yannick . wurm @ unil . ch
> Ant Genomics, Ecology & Evolution @ Lausanne
> BioRuby mailing list
> BioRuby at lists.open-bio.org
More information about the BioRuby