[Biopython-dev] Determining if seq alphabet is protein/dna/rna

Mon Oct 30 00:13:57 UTC 2006

Hello all,

I've been looking at writing multiple sequence alignments in Nexus 
format for the new Bio.SeqIO code, and came up with the following little 
problem:

Given one or more Seq objects, how can I reliably decide if they are 
protein, DNA, or RNA?

(These are the relevant choices in a Nexus file's format datatype=... 
header.)

I'm resigned to the fact that if the Seq object has the generic alphabet 
this boils down to looking at the sequence strings and making an 
educated guess (probably following an established algorithm from an 
alignment program).  Does any such code already exist in BioPython?

However - is there a nice/official way to ask an alphabet object what it 
is (protein, DNA, RNA)?

Looking over the code in Bio.Alphabet the only thing I can think of is 
to get the class name as a string and search it(!)  We can't look at the 
letters property as this is None for the base classes like ProteinAlphabet.

If we are prepared to meddle with the alphabet system we might add 
attributes like "isProtein", "isNucleotide", "isRNA", "isDNA" to these 
base classes.  Or simply have a "sequence_type" method, which the 
subclasses can re-define as required.

(I wasn't meaning to reopen the whole "do we need alphabets" 
conversation last discussed in July 2006.  At least, not yet...)

Peter