[BioRuby] Performance of Bio::Blast.reports

Chris Fields cjfields at illinois.edu
Mon Nov 22 04:41:21 UTC 2010


On could, in fact, use the libxml2 pull parser to grab what you need, which doesn't require pulling the entire XML document into memory.

chris

On Nov 21, 2010, at 5:36 AM, Pjotr Prins wrote:

> Unfortunately BioRuby still loads it in RAM. Someone should do a
> libxml2 version.
> 
> Easiest is to split the XML beforehand; that is what I do.
> 
> Pj.
> 
> On Sun, Nov 21, 2010 at 12:16:56PM +0100, Philipp Comans wrote:
>> Hi everyone,
>> 
>> I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. 
>> Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby.
>> I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand.
>> 
>> Right now, I am parsing the reports in XML format using the following command:
>> 
>> blast_reports = Bio::Blast.reports(file, :rexml)
>> 
>> Is there any performance advantage when using REXML instead of the default XML parser?
>> 
>> In your opinion, is it possible to parse such a large report in XML format?
>> An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information.
>> 
>> Thanks for your help!
>> 
>> Best regards,
>> 
>> Philipp
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby





More information about the BioRuby mailing list