[Biopython-dev] Sequence object allows non-alphabet characters

Sun Dec 18 13:50:07 UTC 2011

Dear Biopyhton developers,

I wonder why the following code does not throw an exception:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> mySeq = Seq("GATC1234YWSK", IUPAC.unambiguous_dna)
>>> mySeq
Seq('GATC1234YWSK', IUPACUnambiguousDNA())

I expected that trying to generate a sequence object containing non-alphabet
characters would either throw an exception/warning or "downgrade" the alphabet,
if possible. 

Another facet of the same problem are whitespaces:

>>> mySeq = Seq("GATC GATC", IUPAC.unambiguous_dna)
>>> mySeq
Seq('GATC GATC', IUPACUnambiguousDNA())
>>> len(mySeq)
9

Which is problematic when the sequence length is required (calculating GC
content, calculating melting temperature, etc.)

While it could be argued that checking the integrity of the sequence data is
related to parsing, I think that the sequence in the sequence object should
never contain whitespaces and if an alphabet is assigned it should not contain
non-alphabet characters. So this should be handled by the sequence object itself?