[BioRuby] Performance of Bio::Blast.reports

Philipp Comans philipp.comans at googlemail.com
Sun Nov 21 11:16:56 UTC 2010


Hi everyone,

I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. 
Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby.
I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand.

Right now, I am parsing the reports in XML format using the following command:

blast_reports = Bio::Blast.reports(file, :rexml)

Is there any performance advantage when using REXML instead of the default XML parser?

In your opinion, is it possible to parse such a large report in XML format?
An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information.

Thanks for your help!

Best regards,

Philipp



More information about the BioRuby mailing list