[BioRuby] Benchmarking FASTA file parsing
Tomoaki NISHIYAMA
tomoakin at kenroku.kanazawa-u.ac.jp
Sat Aug 14 14:52:57 UTC 2010
Hi,
> Subparsing of the defline should not be done for generic parsing,
> but rather when needed.
To my understanding, the subparsing of the definition occurs only
when needed, ie when entry_id, identifiers, gi, etc. is called, in
current code.
If only definition is called, it is not further parsed.
> Without any experience I think that disabling GC sounds like a bad
> idea.
Yes, completely disabling GC is generally a bad idea.
A code running with 6 Gbytes mem could eat 60 Gbytes or more...
(Yes it seems two or three-fold faster if there is enough memory,
but this trade-off is too extreme).
But since the GC dominates the running time,
it is an important target for optimization.
http://en.wikibooks.org/wiki/Ruby_Programming/Reference/Objects/GC
A more moderate reduction of GC frequency will surely speedup the
process 30~50%.
Admittedly, explicit GC.disable, GC.start make the code ugly.
Trial on tweaking the parameters in gc.c did only a minor (~5%)
improvement.
Careful coding to reduce object creation might contribute to speed up.
One of questionable variable is
@entry_overrun
Is this variable and attr_reader :entry_overrun
really required yet or is just a trace of older code? > Goto-San
Since there is only two other variables, which is apparently essential,
this third variable might account significant speed reduction.
A tests suggested again removing 3 lines can improve 5%.
(Unfortunately not 50%)
diff --git a/lib/bio/db/fasta.rb b/lib/bio/db/fasta.rb
index 7ea668e..95f3be4 100644
--- a/lib/bio/db/fasta.rb
+++ b/lib/bio/db/fasta.rb
@@ -111,7 +111,7 @@ module Bio
# The seuqnce lines in text.
attr_accessor :data
- attr_reader :entry_overrun
+# attr_reader :entry_overrun
# Stores the comment and sequence information from one entry of
the
# FASTA format string. If the argument contains more than one
@@ -119,8 +119,8 @@ module Bio
def initialize(str)
@definition = str[/.*/].sub(/^>/, '').strip # 1st line
@data = str.sub(/.*/, '') # rests
- @data.sub!(/^>.*/m, '') # remove trailing entries for sure
- @entry_overrun = $&
+# @data.sub!(/^>.*/m, '') # remove trailing entries for sure
+# @entry_overrun = $&
end
# Returns the stored one entry as a FASTA format. (same as to_s)
--
Tomoaki NISHIYAMA
Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan
On 2010/08/14, at 17:21, Martin Asser Hansen wrote:
> I was hoping for an easy to use generic FASTA parser in bioruby. I
> think it would be confusing with two flavors of parsers for short/
> long entries. Also, I think that with a minor effort the existing
> parser could be optimized a fair bit. Subparsing of the defline
> should not be done for generic parsing, but rather when needed.
> Without any experience I think that disabling GC sounds like a bad
> idea. Of cause C is always faster, but Ruby is nicer.
>
>
> Cheers,
>
>
> Martin
>
>
>
>
> On Sat, Aug 14, 2010 at 5:42 AM, Tomoaki NISHIYAMA
> <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:
> Hi,
>
>> Mind you that the Benchmark is performed on StringIO data, and
>> that the script does not touch the disk!
>> In a real test, it will be much slower!
>
> My initial thought was :- That's true, and therefore the pure
> parser part which runs fairly fast with O(N)
> is not the primary problem. If you push the entries into a hash it
> will be much more time consuming.
>
> But realized that its two orders slower... (due to the benchmark
> code as pointed out by goto-san)
> 20 min for 100 M could be painful.
>
>> I have been trying to get an overview of the code in
>> Bio::FastaFormat,
>> but I find it hard to read (that could be because I am not used to
>> Ruby, or OO for that matter).
>
>
> For one thing, the Bio::FastaFormat is designed to work with
> Bio::FlatFile.
> If you write a dedicated fasta parser that could run much faster.
>
> # I would write C codes for a very simple operation on NGS data.
> # That will run 100 times faster.
> # When the necessary operation is a bit more complex, I would use
> ruby. much much more time consuming....
>
> Perhaps the target is to process about 20 ~ 1000 M reads with each of
> them having 25 to 150 nt for the time being.
> Thats quite different situation compared to process the
> ~ 0.1 M entry of 50-10000 aa residues or nucleotides in a genome.
> The relative cost for the entry separation becomes higher compared
> with the sequence
> processing within the entry.
>
> So, it may worth to write NGS dedicated parser rather than sticking
> on FlatFile.
>
> Playing around the benchmark, about the half of execution time is
> for garbage collection,
> and the order of execution is somewhat relevant to get the number.
> If you can suppress unnecessary object generation to the minimum
> and disable GC, that will
> perhaps make it run much faster.
>
> $ diff -u benchfasta benchfasta-hash-GC-b
> --- benchfasta 2010-08-13 21:45:21.000000000 +0900
> +++ benchfasta-hash-GC-b 2010-08-14 11:53:20.000000000 +0900
> @@ -34,6 +34,9 @@
> end
> end
>
> +count = ARGV.shift.to_i
> +count = 2 if count == nil
> +
> data = <<DATA
> >5_gECOjxwXsN1/1
> AACGNTACTATCGTGACATGCGTGCAGGATTACAC
> @@ -57,12 +60,23 @@
> TTATGATGCGCGTGGCGAACGTGAACGCGTTAAAC
> DATA
>
> -io1 = StringIO.new(data)
> -io2 = StringIO.new(data)
> +io0 = StringIO.new(data * count)
> +io1 = StringIO.new(data * count)
> +io2 = StringIO.new(data * count)
> +fasta0 = Fasta.new(io0)
> fasta1 = Fasta.new(io1)
> fasta2 = Bio::FastaFormat.open(io2)
>
> -Benchmark.bm(5) do |timer|
> - timer.report('Hack') { 10_000_000.times { fasta1.each { |
> entry1| } } }
> - timer.report('Bio') { 10_000_000.times { fasta2.each { |
> entry2| } } }
> +hash0=Hash.new
> +hash1=Hash.new
> +hash2=Hash.new
> +
> +Benchmark.bm(8) do |timer|
> + GC.enable;GC.start;GC.disable;
> + timer.report('Bio') { i=0; fasta2.each { |entry2| i+=1; hash2
> [entry2.definition + i.to_s] = entry2.seq[2..25]} }
> + hash2 = nil; GC.enable;GC.start;GC.disable;
> + timer.report('Hack') { i=0; fasta0.each { |entry1| i+=1; hash0
> [entry1[:seq_name] + i.to_s] = entry1[:seq][2..25]} }
> + hash0 = nil; GC.enable;GC.start;GC.disable;
> + timer.report('Hack-seq') { i=0; fasta1.each { |entry1| i+=1;
> hash1[entry1[:seq_name] + i.to_s] = Bio::Sequence::NA.new(entry1
> [:seq])[2..25]} }
> + hash1 = nil; GC.enable;GC.start;GC.disable;
> end
>
>
>
>
>
>
> --
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>
> On 2010/08/13, at 23:51, Martin Asser Hansen wrote:
>
>>
>> As you stated 3 times faster with the hack, you may be already
>> using ruby 1.9.
>>
>>
>> I am using ruby 1.9.1, and I am using a fairly fast computer, but
>> I am actually questioning the quality of the code.
>>
>> Anyway, I think 13 or 18 seconds for 100 M entry is fast enough
>> and this
>> part will not be the bottle neck of any application.
>> How fast do you need it be?
>>
>> Mind you that the Benchmark is performed on StringIO data, and
>> that the script does not touch the disk! In a real test, it will
>> be much slower! I did not test on real data and more speed issues
>> may surface (I have no idea how Ruby's file buffering compares to
>> Perl's, performance-wise).
>>
>> I was contemplating porting some Biopieces (www.biopieces.org)
>> from Perl to Ruby. Biopieces are used for everyday slicing and
>> dicing of all sorts of biological data in a very simple and
>> flexible manner. While Biopieces are not as fast as dedicated
>> scripts, they are fast enough for convenient analysis of NGS data,
>> but I will not accept a +300% speed penalty (i.e. read_fasta).
>>
>> I have been trying to get an overview of the code in
>> Bio::FastaFormat, but I find it hard to read (that could be
>> because I am not used to Ruby, or OO for that matter). It strikes
>> me that the FastaFormat class does a number of irrelevant things
>> like subparsing comments when not strictly necessary. In fact, the
>> FASTA format actually don't use comments prefixed with #
>> (semicolon can be used, but I will strongly advice against it
>> since most software don't deal with it). Also, parsing is
>> dependent on the record separator being '\n' - that could be
>> considered a bug. There seem to be an overuse of substitutions,
>> transliterations and regex matching. How about keeping it nice an
>> tight? ala:
>>
>> SEP = $/
>> FASTA_REGEX = /\s*>?([^#{SEP}]+)#{SEP}(.+)>?$/
>>
>> def get_entry
>> block = @io.gets(SEP + ">")
>> return nil if block.nil?
>>
>> if block =~ FASTA_REGEX
>> seq_name = $1
>> seq = $2
>> else
>> raise "Bad FASTA entry->#{block}"
>> end
>>
>> seq.gsub!(/\s/, "")
>> end
>>
>>
>> Cheers,
>>
>>
>> Martin
>>
>> --
>> Tomoaki NISHIYAMA
>>
>> Advanced Science Research Center,
>> Kanazawa University,
>> 13-1 Takara-machi,
>> Kanazawa, 920-0934, Japan
>>
>>
>
>
More information about the BioRuby
mailing list