[Biojava-dev] Protein alphabet names

Thomas Down td2@sanger.ac.uk
Tue, 22 Oct 2002 10:34:15 +0100


Hi...

I've been working to tidy up the alphabet bootstrapping code
in AlphabetManager.java, initially with the aim of reducing
startup overhead of constructing a big DOM tree, but it's
turned into a bit more of a refactoring and rationalizing
exercise.

One thing I noted is that there's a rather significant inconsistency
in how symbols are named.  For nucleic acids, the name is the
actual chemical name of the base -- adenine, guanine, etc.  However,
for proteins we use three-letter code (ALA, GLN).  This dates
back to the days when symbols just had `long' and `short' forms,
and we decided that for proteins the most important representations
were 3-letter and 1-letter codes.  However, we now have separat
SymbolTokenization objects, which mean that this is no longer
so much of an issue.  What I propose for 1.3 is to:

  - Make the name field the actual name of the amino acid
    (alanine, glutamine).

  - Add an additional tokenization (probably called "three-letter"
    unless someone comes up with a better suggestion) for people
    who actually want 3-letter codes.

I understand that this change might break a few programs -- this should
be pretty easy to correct for, though.

Does anyone have any objections to this?

    Thomas.

PS. I'm also adding the regularly-requested feature of providing
    a public method for parsing additional, user-supplied
    AlphabetManager.xml-format files.