[BioRuby] Blast parsing speed

Wed Sep 27 06:18:45 UTC 2006

Hi Yannick,

Sounds like bioruby is reading the entire DOM tree of the blast  
output XML into memory (with all the paging, etc.). That looks like  
what's happening in bio/appl/blast/rexml.rb. It looks like if you  
have the xmlparser library installed (http://raa.ruby-lang.org/ 
project/xmlparser/), which is a SAX parser, it will use that, and  
that should solve you problem.

We might want to look into using a pull parser instead of a DOM  
parser, i.e. in Ruby use rexml/parsers/pullparser instead of the  
rexml/document. Pull parsers are nice because they are as memory- 
efficient as SAX parsers but allow you to use a more familiar  
procedural programming style rather than an event-driven style (like  
in xmlparser).

So, it's less an issue of the programming language, and more of the  
type of XML parser.

Hope that helps, it's a guess but I think it's probably what you're  
encountering,

Moses

On Sep 24, 2006, at 6:28 AM, Yannick Wurm wrote:

> Hi,
> I have been happily using bioruby for the past year or so for my post-
> blast analyses. Occasionally, I will have ~ 1gb blast result files
> that need to be parsed. Here my machine may start paging and slows to
> a crawl.
>
> Thus I wonder:
> 	- has anyone benchmarked bioruby, bioperl, biojava, biopython when
> processing the same file to compare speed and memory usage?
> 	- For the sake of future compatibility, I have been use blast's xml
> output. How much slower is it is to parse such an xml file relative
> to a "normal" or tabular blast output?
>
> Cheers,
>
> Yannick
>
> --------------------------------------------
>           yannick . wurm @ unil . ch
> Ant Genomics, Ecology & Evolution @ Lausanne
>    http://www.unil.ch/dee/page28685_fr.html
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>