[BioRuby] Biased Bio::Sequence randomize()

Mon Oct 13 19:25:16 UTC 2008

Hi,

 

I believe that the current sequence randomization/shuffle method is severely
biased, infrequent bases are more likely to occur in the end of the sequence
than in the beginning:

 

class Array

  #returns a histogram represented as a hash

  def hist()

    h = Hash.new(0)

    self.each{|x| h[x] += 1}

    h

  end

end

 

>> (1..1000).to_a.map{|i|
Bio::Sequence::NA.new("ccccggac").randomize.index("a") + 1}.hist.sort

=> [[1, 36], [2, 51], [3, 62], [4, 97], [5, 127], [6, 189], [7, 219], [8,
219]]

 

I suggest implementing this method using the unbiased Fisher-Yates shuffle
(http://en.wikipedia.org/wiki/Fisher-Yates_shuffle)

 

class Array

  def shuffle()

    arr = self.dup

    arr.size.downto 2 do |j|

      r = Kernel::rand(j)

      arr[j-1], arr[r] = arr[r], arr[j-1]

    end

    arr

  end

end

 

(1..1000).to_a.map{|i|
Bio::Sequence::NA.new("ccccggac").split("").shuffle.index("a") +
1}.hist.sort

=> [[1, 121], [2, 127], [3, 135], [4, 119], [5, 145], [6, 104], [7, 126],
[8, 123]]

 

-Anders Jacobsen