[BioRuby] Parsing large Blast xml files - a new bioruby plugin

Rob Syme rob.syme at gmail.com
Wed Jun 1 12:26:25 UTC 2011


I pushed a 1.4GB file through each of the parsers, simply counting the
number of hits per iteration:

     user     system      total        real
Rob:    91.510000   0.620000  92.130000 ( 92.527617)
Pjotr:  46.730000   0.430000  47.160000 ( 47.263949)

One of the important differences in the parsers is that mine is lazy 'all
the way down', in that the iterations are lazy, the hits are lazy and the
hsps are lazy. No large chunks of XML are ever buffered into a string and
then parsed together. While lazy-loading is a good idea, and should probably
be adopted in more of the BioRuby core, taking it to this extreme is a bit
silly.
Pjotr's (more sensible) approach is to chunk up the file by iterations, and
then use XPath to pull out the relevant information from there. One
iteration will never be more than a few kb - certainly no strain on memory
consumption. The IO strain of reading a file in tiny pieces looks to be the
cause of the 2x slowdown in the example above.

Lesson 1: Pragmatism is a good thing.
Lesson 2: Always check to make sure work you're doing hasn't been done
before
Lesson 3: Use Pjotr's parser to make light work of your large Blast results.

-r

On Wed, Jun 1, 2011 at 4:49 PM, Pjotr Prins <pjotr.public14 at thebird.nl>wrote:

> The general idea is to have a number of 'blessed' plugins tied to
> BioRuby releases. A blessed plugin is supposed to be rather solid,
> and have a level of documentation and testing.
>
> In addition there are 'development' plugins. Both should be listed on
> the plugin page. We are introducing that plumbing shortly. The
> duplication of work merely points out we need to get this done ;)
>
> It is interesting to note both XML parsers use lazy iterators. I also
> do lazy conversions. Same for my GFF3 plugin. Rob, be good to compare
> performance on some real-life data.
>
> Pj.
>
> On Wed, Jun 01, 2011 at 04:33:36PM +0800, Rob Syme wrote:
> > I think that the list at
> > http://bioruby.open-bio.org/wiki/BioRuby_Plugins is pretty
> > comprehensive, my mistake was simply not looking.
> > -r
> >
> >
> > On Wed, Jun 1, 2011 at 4:25 PM, Philipp Comans
> > <philipp.comans at googlemail.com> wrote:
> > > Hi,
> > >
> > > I had a similar problem recently. I needed an efficient parser for
> Blast XML results and I discovered that the default parser in BioRuby was
> not suitable. So I wrote my own using Nokogiri.
> > > In my opinion it is way too hard at the moment to discover BioPlugins.
> When people use the default XML or GFF parser that comes with BioRUby, they
> do not expect that there is another, more efficient version. There should be
> a section on the front page or even in the corresponding parts of the API
> documentation that makes people aware of the existence of these efficient
> parsers.
> > >
> > > BTW thank you all for BioRuby, I used in a project recently and it made
> my life tremendously easier.
> > >
> > > Cheers,
> > >
> > > Philipp
> > >
> >
>



More information about the BioRuby mailing list