[BioRuby] Translate ambiguous sequence

Thu Sep 11 01:51:43 UTC 2008

Hi,

Bioruby's translate any codon containing ambiguity code to unknown or  
"X".
However, sometimes, it is desirable to translate
into a fixed amino acid when it is possible.
tty -> "F"

seeing the core implementation being
naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown}
changing unknown to ct.translate_ambiguity(codon, unknown)
will not hurt the performance for sequence without ambiguity,
and trying to resolve degenerate codons is worth to do.
Also, the sequence in GenBank is usually translated as such.

What do you think?

diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ 
bio/data/codontable.rb bioruby-c/lib/bio/data/codontable.rb

--- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ 
data/codontable.rb     2008-09-03 22:24:39.000000000 +0900
+++ bioruby-c/lib/bio/data/codontable.rb        2008-09-11  
09:49:23.000000000 +0900
@@ -93,6 +93,23 @@
    def [](codon)
      @table[codon]
    end
+  def translate_ambiguity(codon, unknown = 'X')
+    triplet = codon + "NNN"
+    aa = nil
+    Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third|
+      Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do| 
first|
+        Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do| 
second|
+          if aa == nil
+            aa = @table[first+second+third]
+          elsif
+            aa != @table[first+second+third]
+            return unknown
+          end
+        end
+      end
+    end
+    aa
+  end

    # Modify the codon table.  Use with caution as it may break hard  
coded
    # tables.  If you want to modify existing table, you should use copy
diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ 
bio/data/na.rb bioruby-c/lib/bio/data/na.rb
--- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ 
data/na.rb 2008-09-03 22:24:39.000000000 +0900
+++ bioruby-c/lib/bio/data/na.rb        2008-09-11 09:26:00.000000000  
+0900
@@ -182,6 +182,13 @@
        end
        Regexp.new(str)
      end
+    def ambiguity2individual(na, rna = false)
+      str = NAMES[na.downcase].gsub(/[\[\]]/,"")
+      if rna
+        str.tr!("t", "u")
+      end
+      str.split(//)
+    end

    end

diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ 
bio/sequence/na.rb bioruby-c/lib/bio/sequence/na.rb
--- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ 
sequence/na.rb2008-09-03 22:24:39.000000000 +0900
+++ bioruby-c/lib/bio/sequence/na.rb    2008-09-11 09:48:52.000000000  
+0900
@@ -252,7 +252,7 @@
      end
      nalen = naseq.length - from
      nalen -= nalen % 3
-    aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or  
unknown}
+    aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or  
ct.translate_ambiguity(codon, unknown)}
      return Bio::Sequence::AA.new(aaseq)
    end

-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan