[Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string?

Wed Oct 31 09:54:24 UTC 2007

http://bugzilla.open-bio.org/show_bug.cgi?id=2351

------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2007-10-31 05:54 EST -------
> In short, to my mind a Seq object should have the following properties:
> 1) A Seq object is basically a string, so it should behave as if it were
> subclassed from string.

I agree, where possible the Seq object should act like a string.
In particular str(my_seq) should give the full string.

> 2) As a result, functions that have a sequence as an argument, but don't
> need the added features of a Seq object, should work with strings as well
> as Seq objects.

Again, I agree.  I've doubled checked this works for some of the recently
updated SeqUtils functionality.  I would hope we get this "for free" once the
Seq object itself becomes more string like.

> 3) The sequence should be mutable, so that we won't need a separate
> MutableSeq class. This also implies that a Seq class cannot subclass from
> string, since strings are not mutable.

Why? Python strings are not mutable, and this isn't usually a problem.
Personally, I have never needed a mutable sequence and have only ever used them
in test cases.  Having the basic Seq non-mutable means we can leverage existing
string functionality and optimizations.

Also writing a new mutable sequence in C seems like a bit maintainance load in
the long term (and may complicate the cross platform build process).  Surely we
can get good enough performance via the array of characters route currently
used?

On related remark: The fact that the current MutableSeq methods like
reverse_complement() work in-situ rather than returning a new object makes
switching between the Seq and MutableSeq fiddly.

> 4) Currently, Seq objects have an associated alphabet; SeqRecord objects
> [also] have annotations, dbxrefs, a description, features, id, and name.
> I think a new Seq object should have both, so that we can avoid having both
> a Seq and a SeqRecord class. Of course, some or all of these fields can
> remain None.

I don't really see the benefit over the current scheme.  I'm happy with the
division between Seq and SeqRecord, but we could go for SeqRecord being a more
annotated subclass of the Seq class.  This would be similar to Bioperl's Seq,
PrimarySeq, or RichSeq objects.

Something I do want to add is splicing for SeqRecords, which would return a new
SeqRecord with sensible name/id/description.  I think for this to really be
useful we need to add "per residue annotation", such as lists or strings of
information the same length as the sequence (e.g. predicted secondary
structure, or sequencing quality scores) which would also get spliced when
splicing a SeqRecord.

> 5) A Seq class should have methods that one expects from a sequence class,
> in particular complement(), reverse_complement(), perhaps a modified count()
> that can ignore case.

Usually mixed case sequences are used for a reason, and the user may need both
case sensitive counts and case insensitive counts.  I would keep .count() case
sensistive like a real string, and suggest .upper().count() as a simple
workarround for case in-sensitive counts.

Plus the Seq object should have methods for forward and back transcription and
translation, see Bug 2381

A more drastic change we could consider is getting rid of the alphabet as an
explicit property, and having ProteinSeq, NucleotideSeq, DnaSeq and RnaSeq
(decorator/sub)classes which would have only the relevant biological sequence
methods.  We would lose the expected "letters" feature of the alphabet, but I
don't think this is really helpful at the moment because the Seq class does not
enforce it.

Otherwise I would advocate when creating a Seq object (or editing a MutableSeq
object) the new letters should be screened against self.alphabet.letters (if
present).

On balance I favour making gradual changes which don't change the current
scheme (Seq with Alphabet property; SeqRecord with Seq property).  Anything
more drastic might best be pursued on a new branch which could become Biopython
2.0

P.S. We should try not to implicitly assume that the elements in a sequence are
single letters?  What about when working with protein structures which contain
modified amino acids (with defined three letter codes) which do not map back to
single letters.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.