[Biopython-dev] forward_complement, reverse_complement

Michiel Jan Laurens de Hoon mdehoon at ims.u-tokyo.ac.jp
Sun Jun 27 23:38:14 EDT 2004


I did some timings on the complement, reverse, and reverse_complement in
Bio.SeqUtils, Bio.GFF.easy, and Bio.Seq. It turned out that reverse_complement
and forward_complement in Bio.GFF.easy are faster than their counterparts in
Bio.SeqUtils. However, using the map function gives even faster results:

     def reverse_complement(self):
         from Bio.Data.IUPACData import ambiguous_dna_complement
         self.data = map(lambda c: ambiguous_dna_complement[c], self.data)
         self.data.reverse()
         self.data = array.array('c', self.data)

Here, I implemented reverse_complement as a member function of MutableSeq. My
feeling is that that is the best place for this function, as it also has a
member function "reverse". SeqUtils mainly contains functions that analyze
sequences, but don't modify them.

The timing results are below. Note that the functions in Bio.SeqUtils can handle
both strings and Seq objects, with the Seq objects being slower, while
Bio.GFF.easy and Bio.Seq handle Seq objects only.

Can I go ahead and update CVS to add complement and reverse_complement to
Bio.Seq? I'll clean up Bio.GFF.easy and Bio.SeqUtils accordingly.

--Michiel.

Timings  (in seconds)
=====================

                     Bio.GFF.easy            Bio.SeqUtils         Using map
                   reverse_complement        antiparallel     reverse_complement
Sequence length       Seq object        Seq object   string      Seq object
       1 000             0.002             0.004       0.002         0.002
      10 000             0.017             0.045       0.023         0.012
     100 000             0.166             0.444       0.225         0.117
   1 000 000             1.651             4.347       2.234         1.135
  10 000 000            18.187            45.137      24.179        11.697
100 000 000           192.243           457.680     242.258       116.170

                     Bio.GFF.easy            Bio.SeqUtils         Using map
                   forward_complement         complement          complement
Sequence length       Seq object        Seq object   string      Seq object
       1 000             0.002             0.005       0.002         0.001
      10 000             0.016             0.042       0.020         0.012
     100 000             0.165             0.435       0.192         0.119
   1 000 000             1.638             4.283       1.912         1.166
  10 000 000            17.993            45.085      20.937        11.572
100 000 000           193.528           443.024     209.573       116.916

                                             Bio.SeqUtils          Bio.Seq
                                                reverse            reverse
Sequence length                         Seq object   string      Seq object
       1 000                               0.003       0.001         0.000
      10 000                               0.023       0.003         0.001
     100 000                               0.226       0.022         0.010
   1 000 000                               2.232       0.227         0.107
  10 000 000                              22.592       2.319         1.057
100 000 000                             225.447      23.094        10.559

Michael Hoffman wrote:
>>Bio/GFF/easy.py contains the functions forward_complement and
>>reverse_complement, which return the forward and reverse complement of a
>>sequence object. I had been looking for such functions in Biopython for a while,
>>but I assumed that they were not available as I didn't find them in Bio/Seq.py.
>>I'd like to propose to move those two functions there. Note that Bio.SeqUtils
>>contains similar functions that work on strings but not on sequence objects. Any
>>thoughts?
> 
> 
> I wrote those when Bio.GFF was not part of Biopython and they are
> really only there to support Bio.GFF.
> 
> It would probably be better to change the Bio.SeqUtils funtions to
> work on sequence objects. I imagine the Bio.SeqUtils functions are
> much faster since much of the work gets passed to the native function
> str.translate().

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon





More information about the Biopython-dev mailing list