[Biojava-dev] Protein alphabet names

Matthew Pocock matthew_pocock@yahoo.co.uk
Tue, 22 Oct 2002 11:50:56 +0100 (BST)

 --- Thomas Down <td2@sanger.ac.uk> wrote: > Hi...
> I've been working to tidy up the alphabet
> bootstrapping code
> in AlphabetManager.java, initially with the aim of
> reducing
> startup overhead of constructing a big DOM tree, but
> it's
> turned into a bit more of a refactoring and
> rationalizing
> exercise.

This code is very old (archealogical?) and it's great
that you're going through it.

> One thing I noted is that there's a rather
> significant inconsistency
> in how symbols are named.  For nucleic acids, the
> name is the
> actual chemical name of the base -- adenine,
> guanine, etc.  However,
> for proteins we use three-letter code (ALA, GLN). 
> This dates
> back to the days when symbols just had `long' and
> `short' forms,
> and we decided that for proteins the most important
> representations
> were 3-letter and 1-letter codes.  However, we now
> have separat
> SymbolTokenization objects, which mean that this is
> no longer
> so much of an issue.  What I propose for 1.3 is to:
>   - Make the name field the actual name of the amino
> acid
>     (alanine, glutamine).
>   - Add an additional tokenization (probably called
> "three-letter"
>     unless someone comes up with a better
> suggestion) for people
>     who actually want 3-letter codes.
> I understand that this change might break a few
> programs -- this should
> be pretty easy to correct for, though.
> Does anyone have any objections to this?

I have no problems with this as long as apps that
could be using different tokenizations before/after
the change fail spectacularly and there is some
document telling you how to trivialy fix the error.


Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts