[BioRuby] Performance of Bio::Blast.reports

Pjotr Prins pjotr.public14 at thebird.nl
Sun Nov 21 11:36:53 UTC 2010


Unfortunately BioRuby still loads it in RAM. Someone should do a
libxml2 version.

Easiest is to split the XML beforehand; that is what I do.

Pj.

On Sun, Nov 21, 2010 at 12:16:56PM +0100, Philipp Comans wrote:
> Hi everyone,
> 
> I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. 
> Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby.
> I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand.
> 
> Right now, I am parsing the reports in XML format using the following command:
> 
> blast_reports = Bio::Blast.reports(file, :rexml)
> 
> Is there any performance advantage when using REXML instead of the default XML parser?
> 
> In your opinion, is it possible to parse such a large report in XML format?
> An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information.
> 
> Thanks for your help!
> 
> Best regards,
> 
> Philipp
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby



More information about the BioRuby mailing list