[BioRuby] Biased Bio::Sequence randomize()
    Anders Jacobsen 
    andersbj at binf.ku.dk
       
    Mon Oct 13 19:25:16 UTC 2008
    
    
  
Hi,
 
I believe that the current sequence randomization/shuffle method is severely
biased, infrequent bases are more likely to occur in the end of the sequence
than in the beginning:
 
class Array
  #returns a histogram represented as a hash
  def hist()
    h = Hash.new(0)
    self.each{|x| h[x] += 1}
    h
  end
end
 
>> (1..1000).to_a.map{|i|
Bio::Sequence::NA.new("ccccggac").randomize.index("a") + 1}.hist.sort
=> [[1, 36], [2, 51], [3, 62], [4, 97], [5, 127], [6, 189], [7, 219], [8,
219]]
 
I suggest implementing this method using the unbiased Fisher-Yates shuffle
(http://en.wikipedia.org/wiki/Fisher-Yates_shuffle)
 
class Array
  def shuffle()
    arr = self.dup
    arr.size.downto 2 do |j|
      r = Kernel::rand(j)
      arr[j-1], arr[r] = arr[r], arr[j-1]
    end
    arr
  end
end
 
(1..1000).to_a.map{|i|
Bio::Sequence::NA.new("ccccggac").split("").shuffle.index("a") +
1}.hist.sort
=> [[1, 121], [2, 127], [3, 135], [4, 119], [5, 145], [6, 104], [7, 126],
[8, 123]]
 
-Anders Jacobsen
    
    
More information about the BioRuby
mailing list