[Biopython-dev] Rethinking Seq objects

Wed Apr 27 23:29:33 EDT 2005

Michael Hoffman wrote:

> On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote:
> 
>> 1) Make Seq objects mutable, and get rid of MutableSeq.
> 
> I imagine it will be a lot slower to replace built-in strings with
> character arrays. Right now, I only use Seq when I absolutely have to.
Well I wouldn't replace them with character arrays, the idea would be to 
reimplement the Seq class in C. So it would not be slower than built-in strings, 
maybe even a bit faster. The Seq object would look like a string object, but be 
mutable.

> Similarly, I think lots of magic trying to figure out the alphabet is
> a bad idea. There are only a few operations that actually require the
> alphabet to be known, and most of the time I store a sequence in
> memory I'm not going to need any of these, so having to deal with
> alphabet issues when it's unnecessary is just going to be a pain in
> the butt that will keep me from using Seq. Similarly, I use augmented
> alphabets with things like B in them and I don't want Seq yelling at
> me when there's no point. Sure, if it can't figure out how to revcom
> the sequence, but just to instantiate it?

OK, then how about this:
- By default, don't assume a particular alphabet. Same as how it works now:
 >>> from Bio.Seq import *
 >>> Seq('ATCG')
Seq('ATCG', Alphabet())
- If the user decides to specify the alphabet, make sure the sequence is 
consistent with it. Of course, if the alphabet is Alphabet(), don't do any input 
checking. So essentially, the user gets to decide whether she wants input 
checking for the sequence or not.

>> >>> Right now, we can do
>> >>>  from Bio.Seq import *
>> >>>  from Bio.Alphabet import IUPAC
>> >>>  my_alpha = IUPAC.unambiguous_dna
>> >>>  my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>> >>>  my_seq[:10] = "weirdstuff"
>> >>>  my_seq
>> MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), 
>> IUPACUnambiguousDNA())
> 
> "Doctor, it hurts when I do this."
> "Don't do that."

Well you would be right if this were Biofortran. For a higher-level language, I 
would expect better checking to make sure an object is self-consistent. Python 
itself is full of checks and assertions.
Another option would be to get rid of alphabets altogether. What good are they 
otherwise?

>> 4) Make Seq objects understand circular genomes. Many bacterial 
>> genomes are circular. It would be nice if we could take the indices 
>> [-1000:1000] from a Seq object, if it is circular, or 
>> [3999000:40001000] if the sequence is circular with length 4000000.
> 
> I'm sure that will be useful to some people. But having a CircularSeq
> subclass would make it easier to avoid this extra functionality from
> impacting on the primary use case.

My feeling is that having a subclass is a bit of an overkill. The idea is to 
have an optional topology argument, which defaults to "linear". So the primary 
use case would not be affected.

> 
>> 5) Perhaps it would be a good idea to add transcribe and translate 
>> methods to the Seq class.
> 
> +1
> 
> You would obviously have to specify an alphabet for this, but I'm fine
> with that so long as I'm not forced to when I don't need to.

If the alphabet defaults to Alphabet() when creating a Seq object, then I'd 
think the transcribe and translate methods should work even if a user doesn't 
specify the sequence to be DNA or RNA. My current gripe with the Seq object is 
that there are too many steps to translate a DNA sequence.

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon