[Biopython-dev] Rethinking Seq objects

Wed Apr 27 08:37:03 EDT 2005

On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote:

> 1) Make Seq objects mutable, and get rid of MutableSeq.

I imagine it will be a lot slower to replace built-in strings with
character arrays. Right now, I only use Seq when I absolutely have to.

Personally, I'd love it if Seq were just a light-weight subclass of
str without the performance penalties of the existing Seq. Using a
Surrogate pattern slows down all those inner loops a lot. Also lots of
unnecessary input-checking does as well. I think performance should be
a concern when you are talking about what should be the most-used part
of the library.

Similarly, I think lots of magic trying to figure out the alphabet is
a bad idea. There are only a few operations that actually require the
alphabet to be known, and most of the time I store a sequence in
memory I'm not going to need any of these, so having to deal with
alphabet issues when it's unnecessary is just going to be a pain in
the butt that will keep me from using Seq. Similarly, I use augmented
alphabets with things like B in them and I don't want Seq yelling at
me when there's no point. Sure, if it can't figure out how to revcom
the sequence, but just to instantiate it?

I think these principles from the Zen of Python would be
well-considered here:

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Sparse is better than dense.
Readability counts.
In the face of ambiguity, refuse the temptation to guess.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.

> > > > Right now, we can do
> > > >  from Bio.Seq import *
> > > >  from Bio.Alphabet import IUPAC
> > > >  my_alpha = IUPAC.unambiguous_dna
> > > >  my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
> > > >  my_seq[:10] = "weirdstuff"
> > > >  my_seq
> MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), 
> IUPACUnambiguousDNA())

"Doctor, it hurts when I do this."
"Don't do that."

> 4) Make Seq objects understand circular genomes. Many bacterial genomes are 
> circular. It would be nice if we could take the indices [-1000:1000] from a 
> Seq object, if it is circular, or [3999000:40001000] if the sequence is 
> circular with length 4000000.

I'm sure that will be useful to some people. But having a CircularSeq
subclass would make it easier to avoid this extra functionality from
impacting on the primary use case.

> 5) Perhaps it would be a good idea to add transcribe and translate methods to 
> the Seq class.

+1

You would obviously have to specify an alphabet for this, but I'm fine
with that so long as I'm not forced to when I don't need to.
-- 
Michael Hoffman <hoffman at ebi.ac.uk>
European Bioinformatics Institute