[Biojava-dev] Protein alphabet names
Matthew Pocock
matthew_pocock@yahoo.co.uk
Tue, 22 Oct 2002 11:50:56 +0100 (BST)
--- Thomas Down <td2@sanger.ac.uk> wrote: > Hi...
>
> I've been working to tidy up the alphabet
> bootstrapping code
> in AlphabetManager.java, initially with the aim of
> reducing
> startup overhead of constructing a big DOM tree, but
> it's
> turned into a bit more of a refactoring and
> rationalizing
> exercise.
This code is very old (archealogical?) and it's great
that you're going through it.
>
> One thing I noted is that there's a rather
> significant inconsistency
> in how symbols are named. For nucleic acids, the
> name is the
> actual chemical name of the base -- adenine,
> guanine, etc. However,
> for proteins we use three-letter code (ALA, GLN).
> This dates
> back to the days when symbols just had `long' and
> `short' forms,
> and we decided that for proteins the most important
> representations
> were 3-letter and 1-letter codes. However, we now
> have separat
> SymbolTokenization objects, which mean that this is
> no longer
> so much of an issue. What I propose for 1.3 is to:
>
> - Make the name field the actual name of the amino
> acid
> (alanine, glutamine).
>
> - Add an additional tokenization (probably called
> "three-letter"
> unless someone comes up with a better
> suggestion) for people
> who actually want 3-letter codes.
>
> I understand that this change might break a few
> programs -- this should
> be pretty easy to correct for, though.
>
> Does anyone have any objections to this?
>
I have no problems with this as long as apps that
could be using different tokenizations before/after
the change fail spectacularly and there is some
document telling you how to trivialy fix the error.
Matthew
__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com