[BioRuby] Benchmarking FASTA file parsing

Sat Aug 14 03:42:07 UTC 2010

Hi,

> Mind you that the Benchmark is performed on StringIO data, and that  
> the script does not touch the disk!
> In a real test, it will be much slower!

My initial thought was :- That's true, and therefore the pure parser  
part which runs fairly fast with O(N)
is not the primary problem.  If you push the entries into a hash it  
will be much more time consuming.

But realized that its two orders slower... (due to the benchmark code  
as pointed out by goto-san)
20 min for 100 M could be painful.

> I have been trying to get an overview of the code in Bio::FastaFormat,
> but I find it hard to read (that could be because I am not used to  
> Ruby, or OO for that matter).


For one thing, the Bio::FastaFormat is designed to work with  
Bio::FlatFile.
If you write a dedicated fasta parser that could run much faster.

# I would write C codes for a very simple operation on NGS data.
# That will run 100 times faster.
# When the necessary operation is a bit more complex, I would use  
ruby. much much more time consuming....

Perhaps the target is to process about 20 ~ 1000 M reads with each of
them having 25 to 150 nt for the time being.
Thats quite different situation compared to process the
~ 0.1 M entry of 50-10000 aa residues or nucleotides in a genome.
The relative cost for the entry separation becomes higher compared  
with the sequence
processing within the entry.

So, it may worth to write NGS dedicated parser rather than sticking  
on FlatFile.

Playing around the benchmark, about the half of execution time is for  
garbage collection,
and the order of execution is somewhat relevant to get the number.
If you can suppress unnecessary object generation to the minimum and  
disable GC, that will
perhaps make it run much faster.

$ diff -u benchfasta benchfasta-hash-GC-b

--- benchfasta  2010-08-13 21:45:21.000000000 +0900
+++ benchfasta-hash-GC-b        2010-08-14 11:53:20.000000000 +0900
@@ -34,6 +34,9 @@
    end
  end

+count = ARGV.shift.to_i
+count = 2 if count == nil
+
  data  = <<DATA
  >5_gECOjxwXsN1/1
  AACGNTACTATCGTGACATGCGTGCAGGATTACAC
@@ -57,12 +60,23 @@
  TTATGATGCGCGTGGCGAACGTGAACGCGTTAAAC
  DATA

-io1    = StringIO.new(data)
-io2    = StringIO.new(data)
+io0    = StringIO.new(data * count)
+io1    = StringIO.new(data * count)
+io2    = StringIO.new(data * count)
+fasta0 = Fasta.new(io0)
  fasta1 = Fasta.new(io1)
  fasta2 = Bio::FastaFormat.open(io2)

-Benchmark.bm(5) do |timer|
-  timer.report('Hack') { 10_000_000.times { fasta1.each { | 
entry1| } } }
-  timer.report('Bio')  { 10_000_000.times { fasta2.each { | 
entry2| } } }
+hash0=Hash.new
+hash1=Hash.new
+hash2=Hash.new
+
+Benchmark.bm(8) do |timer|
+  GC.enable;GC.start;GC.disable;
+  timer.report('Bio')  { i=0; fasta2.each { |entry2| i+=1; hash2 
[entry2.definition + i.to_s]  = entry2.seq[2..25]}  }
+  hash2 = nil; GC.enable;GC.start;GC.disable;
+  timer.report('Hack') { i=0;  fasta0.each { |entry1| i+=1; hash0 
[entry1[:seq_name] + i.to_s] = entry1[:seq][2..25]}  }
+  hash0 = nil; GC.enable;GC.start;GC.disable;
+  timer.report('Hack-seq') { i=0; fasta1.each { |entry1| i+=1; hash1 
[entry1[:seq_name] + i.to_s] = Bio::Sequence::NA.new(entry1[:seq]) 
[2..25]}  }
+  hash1 = nil; GC.enable;GC.start;GC.disable;
  end





-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2010/08/13, at 23:51, Martin Asser Hansen wrote:

>
> As you stated 3 times faster with the hack, you may be already  
> using ruby 1.9.
>
>
> I am using ruby 1.9.1, and I am using a fairly fast computer, but I  
> am actually questioning the quality of the code.
>
> Anyway, I think 13 or 18 seconds for 100 M entry is fast enough and  
> this
> part will not be the bottle neck of any application.
> How fast do you need it be?
>
> Mind you that the Benchmark is performed on StringIO data, and that  
> the script does not touch the disk! In a real test, it will be much  
> slower! I did not test on real data and more speed issues may  
> surface (I have no idea how Ruby's file buffering compares to  
> Perl's, performance-wise).
>
> I was contemplating porting some Biopieces (www.biopieces.org) from  
> Perl to Ruby. Biopieces are used for everyday slicing and dicing of  
> all sorts of biological data in a very simple and flexible manner.  
> While Biopieces are not as fast as dedicated scripts, they are fast  
> enough for convenient analysis of NGS data, but I will not accept a  
> +300% speed penalty (i.e. read_fasta).
>
> I have been trying to get an overview of the code in  
> Bio::FastaFormat, but I find it hard to read (that could be because  
> I am not used to Ruby, or OO for that matter). It strikes me that  
> the FastaFormat class does a number of irrelevant things like  
> subparsing comments when not strictly necessary. In fact, the FASTA  
> format actually don't use comments prefixed with # (semicolon can  
> be used, but I will strongly advice against it since most software  
> don't deal with it). Also, parsing is dependent on the record  
> separator being '\n' - that could be considered a bug. There seem  
> to be an overuse of substitutions, transliterations and regex  
> matching. How about keeping it nice an tight? ala:
>
> SEP         = $/
> FASTA_REGEX = /\s*>?([^#{SEP}]+)#{SEP}(.+)>?$/
>
> def get_entry
>   block = @io.gets(SEP + ">")
>   return nil if block.nil?
>
>   if block =~ FASTA_REGEX
>     seq_name = $1
>     seq      = $2
>   else
>     raise "Bad FASTA entry->#{block}"
>   end
>
>   seq.gsub!(/\s/, "")
> end
>
>
> Cheers,
>
>
> Martin
>
> -- 
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>