[Biopython-dev] Bio.Seq and alphabets

Mon Jul 5 00:40:05 EDT 2004

I've been working on a complement() and reverse_complement() function for 
Bio.Seq's Seq and MutableSeq classes. Previously, similar functions existed in 
various places in Biopython. I am not sure though how to deal with the alphabet 
associated with a Seq or MutableSeq object. For example, a Seq can be created 
where the sequence is inconsistent with the alphabet:

 >>> from Bio.Alphabet import IUPAC
 >>> from Bio.Seq import Seq
 >>> Seq('GATCGACXYSMDG_or_any_funny_char_u_like_eg_*&$%', IUPAC.unambiguous_dna)
Seq('GATCGACXYSMDG_or_any_funny_char_u_like_eg_*&$%', IUPACUnambiguousDNA())

With a MutableSeq, one can change the sequence regardless of the alphabet:
 >>> from Bio.Seq import MutableSeq
 >>> s = MutableSeq('ACTGCCATCGT', IUPAC.unambiguous_dna)
 >>> s[9] = 'X'
 >>> s
MutableSeq(array('c', 'ACTGCCATCXT'), IUPACUnambiguousDNA())

Anyway, my immediate concern is how to deal with uppercase and lowercase 
characters. The reverse_complement function in Bio.GFF.easy converts lowercase 
characters to uppercase before taking the complement:

def _forward_complement_list_with_table(table, seq):
     return [table[x] for x in seq.tostring().upper()]

However, the complement and antiparallel functions in Bio.SeqUtils are not 
implemented for lowercase sequences:

_before = ''.join(IUPACData.ambiguous_dna_complement.keys())
_after = ''.join(IUPACData.ambiguous_dna_complement.values())
_ttable = maketrans(_before, _after)

def complement(seq):
     """Returns the complementary sequence (NOT antiparallel).

     This works on string sequences, not on Bio.Seq objects.
     """
     #Much faster on really long sequences than the previous loop based one.
     #thx to Michael Palmer, University of Waterloo
     return seq.translate(_ttable)

So there are two issues we need to decide:

1) Should we modify the Seq and MutableSeq classes such that the sequence is 
always consistent with the alphabet?

2) Should we allow lowercase characters in the sequence?

My own preference at this point is 1) yes 2) no, but I'd like to check what 
y'all think.

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon