[Biojava-dev] Protein alphabet names
Thomas Down
td2@sanger.ac.uk
Tue, 22 Oct 2002 10:34:15 +0100
Hi...
I've been working to tidy up the alphabet bootstrapping code
in AlphabetManager.java, initially with the aim of reducing
startup overhead of constructing a big DOM tree, but it's
turned into a bit more of a refactoring and rationalizing
exercise.
One thing I noted is that there's a rather significant inconsistency
in how symbols are named. For nucleic acids, the name is the
actual chemical name of the base -- adenine, guanine, etc. However,
for proteins we use three-letter code (ALA, GLN). This dates
back to the days when symbols just had `long' and `short' forms,
and we decided that for proteins the most important representations
were 3-letter and 1-letter codes. However, we now have separat
SymbolTokenization objects, which mean that this is no longer
so much of an issue. What I propose for 1.3 is to:
- Make the name field the actual name of the amino acid
(alanine, glutamine).
- Add an additional tokenization (probably called "three-letter"
unless someone comes up with a better suggestion) for people
who actually want 3-letter codes.
I understand that this change might break a few programs -- this should
be pretty easy to correct for, though.
Does anyone have any objections to this?
Thomas.
PS. I'm also adding the regularly-requested feature of providing
a public method for parsing additional, user-supplied
AlphabetManager.xml-format files.