[BioRuby] Biased Bio::Sequence randomize()

Mon Oct 13 20:55:41 UTC 2008

Dear Jacobsen,

> I believe that the current sequence randomization/shuffle method is severely
> biased, infrequent bases are more likely to occur in the end of the sequence
> than in the beginning:

You are right. 

I had fixed a while ago, but it seems that I forgot to commit to the repository, sorry.

Could you try the following replacement?

  def randomize(hash = nil)
    if hash
      tmp = ''
      hash.each {|k, v|
        tmp += k * v.to_i
      }
    else
      tmp = self
    end
    len = tmp.length - 1
    seq = ''
    ary = (0..len).to_a.sort_by{rand}
    ary.each do |idx|
      if block_given?
        yield tmp[idx]
      else
        seq << tmp[idx]
      end
    end
    return self.class.new(seq)
  end

I haven't yet prepared my git environment, so I hope someone (Goto-san?) can
take care of the following patch.

Regards,
Toshiaki Katayama


% cvs diff -u lib/bio/sequence/common.rb
===========================================
 dev.open-bio.org - Authorized Access Only
===========================================
Index: lib/bio/sequence/common.rb
===================================================================
RCS file: /home/repository/bioruby/bioruby/lib/bio/sequence/common.rb,v
retrieving revision 1.6
diff -u -r1.6 common.rb

--- lib/bio/sequence/common.rb	27 Dec 2007 17:36:02 -0000	1.6
+++ lib/bio/sequence/common.rb	13 Oct 2008 20:50:27 -0000
@@ -241,28 +241,22 @@
   # * (optional) _hash_: Hash object
   # *Returns*:: new Bio::Sequence::NA/AA object
   def randomize(hash = nil)
-    length = self.length
     if hash
-      length = 0
-      count = hash.clone
-      count.each_value {|x| length += x}
+      tmp = ''
+      hash.each {|k, v|
+        tmp += k * v.to_i
+      }
     else
-      count = self.composition
+      tmp = self
     end
-
+    len = tmp.length - 1
     seq = ''
-    tmp = {}
-    length.times do 
-      count.each do |k, v|
-        tmp[k] = v * rand
-      end
-      max = tmp.max {|a, b| a[1] <=> b[1]}
-      count[max.first] -= 1
-
+    ary = (0..len).to_a.sort_by{rand}
+    ary.each do |idx|
       if block_given?
-        yield max.first
+        yield tmp[idx]
       else
-        seq += max.first
+        seq << tmp[idx]
       end
     end
     return self.class.new(seq)



On 2008/10/14, at 4:25, Anders Jacobsen wrote:

> Hi,
>
>
>
> I believe that the current sequence randomization/shuffle method is severely
> biased, infrequent bases are more likely to occur in the end of the sequence
> than in the beginning:
>
>
>
> class Array
>
>  #returns a histogram represented as a hash
>
>  def hist()
>
>    h = Hash.new(0)
>
>    self.each{|x| h[x] += 1}
>
>    h
>
>  end
>
> end
>
>
>
>>> (1..1000).to_a.map{|i|
> Bio::Sequence::NA.new("ccccggac").randomize.index("a") + 1}.hist.sort
>
> => [[1, 36], [2, 51], [3, 62], [4, 97], [5, 127], [6, 189], [7, 219], [8,
> 219]]
>
>
>
> I suggest implementing this method using the unbiased Fisher-Yates shuffle
> (http://en.wikipedia.org/wiki/Fisher-Yates_shuffle)
>
>
>
> class Array
>
>  def shuffle()
>
>    arr = self.dup
>
>    arr.size.downto 2 do |j|
>
>      r = Kernel::rand(j)
>
>      arr[j-1], arr[r] = arr[r], arr[j-1]
>
>    end
>
>    arr
>
>  end
>
> end
>
>
>
> (1..1000).to_a.map{|i|
> Bio::Sequence::NA.new("ccccggac").split("").shuffle.index("a") +
> 1}.hist.sort
>
> => [[1, 121], [2, 127], [3, 135], [4, 119], [5, 145], [6, 104], [7, 126],
> [8, 123]]
>
>
>
> -Anders Jacobsen
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby