[Biopython-dev] Seq object join method

Fri Nov 20 16:11:43 UTC 2009

Hello all,

Some more code to evaluate, again on a branch in github:
http://github.com/peterjc/biopython/commit/c7cd0329061f88e3a8eae0979dd17c54a36ab4e5

This adds a join method to the Seq object, basically an alphabet
aware version of the Python string join method. Recall that for
strings:

sep.join([a,b,c]) == a + sep + b + sep + c

This leads to a common idiom for concatenating a list of strings,

"".join([a,b,c]) == a + "" + b + "" + c == a + b + c

That is fine for strings, but not necessarily for Seq objects since even
a zero length sequence has an alphabet. Consider this example:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet.IUPAC import unambiguous_dna, ambiguous_dna
>>> unamb_dna_seq = Seq("ACGT", unambiguous_dna)
>>> ambig_dna_seq = Seq("ACRGT", ambiguous_dna)
>>> unamb_dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> ambig_dna_seq
Seq('ACRGT', IUPACAmbiguousDNA())

If we add the ambiguous and unambiguous IUPAC DNA alphabets,
we get the ambiguous IUPAC DNA alphabet:

>>> unamb_dna_seq + ambig_dna_seq
Seq('ACGTACRGT', IUPACAmbiguousDNA())

However, if the default generic alphabet is included, the result is
a generic alphabet:

>>> unamb_dna_seq + Seq("") + ambig_dna_seq
Seq('ACGTACRGT', Alphabet())

Now consider Seq("").join([unamb_dna_seq, ambig_dna_seq]),
should it follow the addition behaviour (giving a default alphabet)
or "do the sensible thing" and preserve the IUPAC alphabet?

As written, Seq("").join(...) is handled as a special case, and
the alphabet of the empty string is ignored. To me this is a
case of "practicality beats purity", it is much nicer than being
forced to do Seq("", ambiguous_dna).join(...) where the empty
sequence is given a suitable alphabet.

So, what do people think?

Peter