[BioRuby] Performance of Bio::Blast.reports
Philipp Comans
philipp.comans at googlemail.com
Sun Nov 21 11:16:56 UTC 2010
Hi everyone,
I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB.
Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby.
I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand.
Right now, I am parsing the reports in XML format using the following command:
blast_reports = Bio::Blast.reports(file, :rexml)
Is there any performance advantage when using REXML instead of the default XML parser?
In your opinion, is it possible to parse such a large report in XML format?
An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information.
Thanks for your help!
Best regards,
Philipp
More information about the BioRuby
mailing list