[BioRuby] Benchmarking FASTA file parsing

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Fri Aug 13 14:47:35 UTC 2010


Hi,

On Fri, 13 Aug 2010 14:25:46 +0200
Martin Asser Hansen <mail at maasha.dk> wrote:

> io1    = StringIO.new(data)
> io2    = StringIO.new(data)
> fasta1 = Fasta.new(io1)
> fasta2 = Bio::FastaFormat.open(io2)
> 
> Benchmark.bm(5) do |timer|
>   timer.report('Hack') { 10_000_000.times { fasta1.each { |entry1| } } }
>   timer.report('Bio')  { 10_000_000.times { fasta2.each { |entry2| } } }
> end

To rewind the IO (StringIO or Bio::FlatFile object) every time
after reading will be needed during the benchmark.

#(snip)
  Benchmark.bm(5) do |timer|
    timer.report('Hack') { 10_000_000.times { 
      fasta1.each { |entry1| }; io1.rewind } }
    timer.report('Bio')  { 10_000_000.times {
      fasta2.each { |entry2| }; fasta2.rewind } }
  end

Why using "fasta2.rewind" instead of "io2.rewind" is that
the "fasta2" is an instance of Bio::FlatFile, IO wrapper
used in BioRuby, and to keep consistency of information
inside the wrapper, it is recommended using fasta2.rewind
rather than io2.rewind.

I applied above changes, and reduced iteration count to
100,000 times, and get the result with the same tendency.

(ruby 1.8.7-p299 (debian Squeeze 1.8.7.299-1))
           user     system      total        real
Hack   7.240000   0.160000   7.400000 (  7.390807)
Bio   23.250000   0.850000  24.100000 ( 24.100267)

(ruby 1.9.1-p243 with env LANG=C)
           user     system      total        real
Hack   5.600000   0.010000   5.610000 (  5.605175)
Bio   15.920000   0.000000  15.920000 ( 15.917899)


With E.coli genome ORF data, the difference become smaller,
especially in Ruby 1.9.1.

(snip)
  # ftp://ftp.ncbi.nih.gov:/genbank/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/U00096.ffn
  io1    = File.open('U00096.ffn')
  io2    = File.open('U00096.ffn')
  fasta1 = Fasta.new(io1)
  fasta2 = Bio::FastaFormat.open(io2)

  Benchmark.bm(5) do |timer|
    timer.report('Hack') { 100.times { 
      fasta1.each { |entry1| }; io1.rewind } }
    timer.report('Bio')  { 100.times { 
      fasta2.each { |entry2| }; fasta2.rewind } 
  }
  end

(ruby 1.8.7-p299)
           user     system      total        real
Hack   8.340000   0.140000   8.480000 (  8.492107)
Bio   13.480000   0.520000  14.000000 ( 13.998213)

(Ruby 1.9.1-p243 with env LANG=C)
           user     system      total        real
Hack   9.130000   0.140000   9.270000 (  9.270361)
Bio    9.380000   0.180000   9.560000 (  9.565899)

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org



More information about the BioRuby mailing list