[BioRuby] Benchmarking FASTA file parsing
    Martin Asser Hansen 
    mail at maasha.dk
       
    Fri Aug 13 12:25:46 UTC 2010
    
    
  
Hello,
I am new to Ruby and was testing bioruby (1.4.0) for parsing FASTA files. A
rough comparison with Perl indicated that the bioruby parser was slow. Now I
have hacked a parser of my own in Ruby in order to benchmark the bioruby
parser. The result is disappointing -> my hack is roughly 3x faster.
Admittedly, my hack should probably do a bit of format consistency checking,
but that will only take a few % off the speed.
Could someone explain why the bioruby parser is so slow?
Is it possible to optimize the code without major rewriting?
Here is the benchmark result:
           user     system      total        real
Hack   5.440000   0.010000   5.450000 (  5.494207)
Bio   18.410000   0.020000  18.430000 ( 18.579867)
The code is shown below.
Cheers,
Martin
#!/usr/bin/env ruby
require 'stringio'
require 'bio'
require 'benchmark'
class Fasta
  include Enumerable
  def initialize(io)
    @io = io
  end
  def each
    while entry = get_entry do
      yield entry
    end
  end
  def get_entry
    block = @io.gets("\n>")
    return nil if block.nil?
    block.chomp!("\n>")
    block.sub!( /^\s|^>/, "")
    (seq_name, seq) = block.split("\n", 2)
    seq.gsub!(/\s/, "")
    entry = {}
    entry[:seq_name] = seq_name
    entry[:seq]      = seq
    entry
  end
end
data  = <<DATA
>5_gECOjxwXsN1/1
AACGNTACTATCGTGACATGCGTGCAGGATTACAC
>3_8ICOjxwXsN1/1
ACTCNAGGGTTCGATTCCCTTCAACCGCCCCATAA
>3_GUCOjxwXsN1/1
TTGCNTCCTTCTTCTGCCTTCGTTGGCTCAGATTG
>5_BWCOjxwXsN1/1
TATATACAGGAATCCATTGTTGTTTAGATTCAGTT
>7_NZCOjxwXsN1/1
AGGTGATCCAGCCGCACCTTCCGATACGGCTACCT
>3_2VCOjxwXsN1/1
CTTTTCCAGGTGTGTAGACATCTTCACCCATTAAG
>5_kVCOjxwXsN1/1
CTACACCTAAGTTACATCGTCCATTATTTTCCAAT
>1_GbCOjxwXsN1/1
CCAGACAACTAGGATGTTGGCTTAGAAGCAGCCAT
>5_fTCOjxwXsN1/1
TTAGCTTTAACCATTTTCTTTTTGTCTAAAGCAAA
>3_VWCOjxwXsN1/1
TTATGATGCGCGTGGCGAACGTGAACGCGTTAAAC
DATA
io1    = StringIO.new(data)
io2    = StringIO.new(data)
fasta1 = Fasta.new(io1)
fasta2 = Bio::FastaFormat.open(io2)
Benchmark.bm(5) do |timer|
  timer.report('Hack') { 10_000_000.times { fasta1.each { |entry1| } } }
  timer.report('Bio')  { 10_000_000.times { fasta2.each { |entry2| } } }
end
    
    
More information about the BioRuby
mailing list