[BioPython] Rethinking Seq objects
Michiel Jan Laurens de Hoon
mdehoon at ims.u-tokyo.ac.jp
Tue May 3 02:45:00 EDT 2005
Hi everybody,
Recently, there was a discussion on biopython-dev about changes to the Seq and
MutableSeq classses. I'd like to ask you if any of the proposed changes would
cause you any problems.
The current proposal is:
1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and the
MutableSeq class basically describe the same thing, except that one is read-only
and the other one is not. If desired, we can add a readonly flag to the class to
describe if it is mutable or not. (Given that e.g. Numerical Python arrays don't
have such a flag, my feeling is that it is not really needed for Seq objects
either). For performance reasons, the new Seq class will be implemented in C.
2) By default, a Seq class doesn't assume a particular alphabet. Same as current
behavior:
>>> from Bio.Seq import *
>>> Seq('ATCG')
Seq('ATCG', Alphabet())
However, if the user decides to specify the alphabet explicitly, input to the
sequence will be checked for consistency with the alphabet. So
>>> from Bio.Seq import *
>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> s = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>> s[:3] = "XYZ"
will raise an error.
3) Make Seq objects understand circular genomes. Many bacterial genomes are
circular. It would be nice if we could take the indices [-1000:1000] from a Seq
object, if it is circular, or [3999000:40001000] if the sequence is circular
with length 4000000.
Circular genomes will likely be implemented as an optional keyword (perhaps
"topology") when creating the Seq object, with corresponding set_topology,
get_topology methods.
4) Perhaps it would be a good idea to add transcribe and translate methods to
the Seq class. Currently, to translate a DNA sequence, we have to do
>>> from Bio.Seq import Seq
>>> from Bio import Translate
>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>> standard_translator = Translate.unambiguous_dna_by_id[1]
>>> standard_translator.translate(my_seq)
Seq('AIVMGR*KGAR', IUPACProtein())
which is too much typing for my taste.
Questions/comments/suggestions are welcome. None of this has actually been coded
yet, so it's all still open to discussion.
--Michiel.
--
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
More information about the BioPython
mailing list