[Biopython-dev] Bio.Seq and alphabets
Michiel Jan Laurens de Hoon
mdehoon at ims.u-tokyo.ac.jp
Mon Jul 5 00:40:05 EDT 2004
I've been working on a complement() and reverse_complement() function for
Bio.Seq's Seq and MutableSeq classes. Previously, similar functions existed in
various places in Biopython. I am not sure though how to deal with the alphabet
associated with a Seq or MutableSeq object. For example, a Seq can be created
where the sequence is inconsistent with the alphabet:
>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import Seq
>>> Seq('GATCGACXYSMDG_or_any_funny_char_u_like_eg_*&$%', IUPAC.unambiguous_dna)
With a MutableSeq, one can change the sequence regardless of the alphabet:
>>> from Bio.Seq import MutableSeq
>>> s = MutableSeq('ACTGCCATCGT', IUPAC.unambiguous_dna)
>>> s = 'X'
MutableSeq(array('c', 'ACTGCCATCXT'), IUPACUnambiguousDNA())
Anyway, my immediate concern is how to deal with uppercase and lowercase
characters. The reverse_complement function in Bio.GFF.easy converts lowercase
characters to uppercase before taking the complement:
def _forward_complement_list_with_table(table, seq):
return [table[x] for x in seq.tostring().upper()]
However, the complement and antiparallel functions in Bio.SeqUtils are not
implemented for lowercase sequences:
_before = ''.join(IUPACData.ambiguous_dna_complement.keys())
_after = ''.join(IUPACData.ambiguous_dna_complement.values())
_ttable = maketrans(_before, _after)
"""Returns the complementary sequence (NOT antiparallel).
This works on string sequences, not on Bio.Seq objects.
#Much faster on really long sequences than the previous loop based one.
#thx to Michael Palmer, University of Waterloo
So there are two issues we need to decide:
1) Should we modify the Seq and MutableSeq classes such that the sequence is
always consistent with the alphabet?
2) Should we allow lowercase characters in the sequence?
My own preference at this point is 1) yes 2) no, but I'd like to check what
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
More information about the Biopython-dev