[BioRuby] Count homopolymers

Raoul Jean Pierre Bonnal raoul.bonnal at itb.cnr.it
Tue Jan 15 11:35:10 UTC 2008


This is a raw way for counting homopolymers in a sequence.

homopolymersInfo = Hash.new{|hash_bases, key_base| hash_bases[key_base]
= Hash.new{|hash_types, key_type| hash_types[key_type] = Array.new;}}

"aggagagatgggggataaaaaatttttaaaaagggatagatgggggaattcccccaccccccaaaattttttgggggggggccccccaaaaa".scan(/a{3,}|c{3,}|g{3,}|t{3,}/i){|m| homopolymersInfo[$&.squeeze][$&.length] << [$~.begin(0),$~.end(0)]}

homopolymersInfo.each_pair{|base,types| types.each_pair{|type,positions|
puts "#{base}, #{type}, #{positions.size}"}}


a, 5, 2
a, 6, 1
a, 4, 1
c, 5, 1
c, 6, 2
g, 5, 2
g, 3, 1
g, 9, 1
t, 5, 1
t, 6, 1
  ==> {"a"=>{5=>[[27, 32], [87, 92]], 6=>[[16, 22]], 4=>[[62, 66]]},
"c"=>{5=>[[50, 55]], 6=>[[56, 62], [81, 87]]}, "g"=>{5=>[[9, 14], [41,
46]], 3=>[[32, 35]], 9=>[[72, 81]]}, "t"=>{5=>[[22, 27]], 6=>[[66,
72]]}}

Positions is an Array of start/end coordinates
I could use sym in hash, would be better.

--
Ra






More information about the BioRuby mailing list