[Biopython-dev] [Bug 2550] Alphabet problems when adding sequences

Sun Jul 27 19:06:22 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2550

------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2008-07-27 15:06 EST -------
With the patch, repeating the example in my comment 0,

>>> from Bio import Alphabet
>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import Seq
>>> a = Seq("ACTG", Alphabet.generic_dna)
>>> b = Seq("AC-TG", Alphabet.Gapped(Alphabet.generic_dna, "-"))
>>> c = Seq("AC-TG", Alphabet.Gapped(IUPAC.unambiguous_dna, "-"))
>>> a
Seq('ACTG', DNAAlphabet())
>>> b
Seq('AC-TG', Gapped(DNAAlphabet(), '-'))
>>> c
Seq('AC-TG', Gapped(IUPACUnambiguousDNA(), '-'))
>>> b+c
Seq('AC-TGAC-TG', Gapped(DNAAlphabet(), '-'))
>>> a+b
Seq('ACTGAC-TG', Gapped(DNAAlphabet(), '-'))
>>> a+c
Seq('ACTGAC-TG', Gapped(DNAAlphabet(), '-'))

i.e. All the above additions work now.

>>> p = Seq("ACDEFG", Alphabet.generic_protein)
>>> q = Seq("ACDEFG", IUPAC.protein)
>>> r = Seq("ACDEFG*", Alphabet.HasStopCodon(IUPAC.protein, "*"))
>>> p
Seq('ACDEFG', ProteinAlphabet())
>>> q
Seq('ACDEFG', IUPACProtein())
>>> r
Seq('ACDEFG*', HasStopCodon(IUPACProtein(), '*'))
>>> p+q
Seq('ACDEFGACDEFG', ProteinAlphabet())
>>> p+r
Seq('ACDEFGACDEFG*', HasStopCodon(ProteinAlphabet(), '*'))

These work too.

>>> c = Seq("AC-TG", Alphabet.Gapped(IUPAC.unambiguous_dna, "-"))
>>> d = Seq('AC.TG', Alphabet.Gapped(IUPAC.unambiguous_dna, '.'))
>>> c
Seq('AC-TG', Gapped(IUPACUnambiguousDNA(), '-'))
>>> d
Seq('AC.TG', Gapped(IUPACUnambiguousDNA(), '.'))
>>> c+d
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "Bio/Seq.py", line 78, in __add__
    a = Alphabet._consensus_alphabet([self.alphabet, other.alphabet])
  File "/home/maubp/repository/biopython/Bio/Alphabet/__init__.py", line 199,
in _consensus_alphabet
    raise ValueError("More than one gap character present")
ValueError: More than one gap character present

The error message has changed (and is more explicit), but I think this is a
real failure case.

Then based on the example in my comment 1,

>>> p = Seq("PKL-PAK", Alphabet.Gapped(Alphabet.generic_protein,"-"))
>>> q = Seq("ADKS*", Alphabet.HasStopCodon(Alphabet.generic_protein,"*"))
>>> p+q
Seq('PKL-PAKADKS*', HasStopCodon(Gapped(ProteinAlphabet(), '-'), '*'))

This works now too.

One final example of a valid failure:

>>> q = Seq("ADKS*", Alphabet.HasStopCodon(Alphabet.generic_protein,"*"))
>>> r = Seq("SRFG@", Alphabet.HasStopCodon(Alphabet.generic_protein,"@"))
>>> q+r
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "Bio/Seq.py", line 78, in __add__
    a = Alphabet._consensus_alphabet([self.alphabet, other.alphabet])
  File "/home/maubp/repository/biopython/Bio/Alphabet/__init__.py", line 208,
in _consensus_alphabet
    raise ValueError("More than one stop symbol present")
ValueError: More than one stop symbol present

I'd be grateful if anyone could test this, or comment on the code.  While
adding private functions to Bio.Alphabet is a reasonable short term solution
(and means we can change arguments and names without breaking people's
scripts!), some of this functionality might be best exposed publically.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.