[BioRuby] Plugins, Biogem and Christmas 2010

Pjotr Prins pjotr.public14 at thebird.nl
Mon Feb 14 14:46:39 UTC 2011


Yet another BioRuby plugin.

I just released a fast BLAST XML file parser for big data (i.e. it
does not necessarily load everything in memory). It is based on
Nokogiri+libxml2.  A quick test shows it is 50x faster than the ReXML
parser that comes with BioRuby.

Install with

  gem install bio-blastxmlparse

It comes with a utility to produce tabular output

  blastxmlparser --help

Docs at

  https://github.com/pjotrp/blastxmlparser

(you may need to install libxml2-dev first, to build the native
extension).

There is a choice of two parsers, loading the DOM in memory, or split
the XML file in smaller sections.

I ran quite a few test to see what type of parsing would give best
results. Currently I parse the DOM, walk the low level nodes, and use
(lazy) XPath for the values.  There is probably still room for
improvement.

One thing I will still try, when I have time, is parallelized parsing
on JRuby. With that it should be one of the fastest BLAST parsers on
the planet. 

Enjoy,

Pj.

On Fri, Dec 24, 2010 at 12:08:04PM +0100, Raoul Bonnal wrote:
> BioRuby plugin system was firstly announced at [BOSC 2010] and will be implemented by the Christmas 2010. Hopefully. :) -- Yes, we made it! Check out the BiogemInstallation and BiogemDevelopment sections.



More information about the BioRuby mailing list