[Biojava-l] SymbolTokenization patch
Thomas Down
td2@sanger.ac.uk
Tue, 16 Oct 2001 14:36:00 +0100
Hi...
We've been seeing a number of problem recently relating
to the construction of alphabets, and in particular binding
names to ambiguous and gap symbols. I'd like to propose
a patch which addresses a number of issues in this area,
and should leave us with a more solid Symbol/Alphabet
infrastructure.
The idea is:
- Remove the `token' (single-character name) property from
the Symbol interface. This was problematic since it was
undefined in many cases (especially cross-product alphabets
where there might well be more symbols than ASCII characters).
- Replace the old SymbolParser (one way string -> symbol) map
with a new interface, SymbolTokenization (a two-way
string <--> symbol map).
In doing so, all sorts of cruft dies. In particular:
- Alphabet creation can be simplified.
- We can get back to the idealized situation whereby alphabets
IMPLICITLY contain all possible ambiguity symbols (including
the gap symbol), and no longer have to pre-seed ambiguity
symbols where they have tokens associated with them (e.g.
all DNA ambiguities, and a sub-set of protein ambiguities).
- We are able to handle multiple naming schemes (for instance,
arguments about what character to use for the gap symbol)
in a clean, transparent way. The underlying Symbols can remain
the same whichever naming scheme (==SymbolTokenization) you
use.
- All the conventions for naming of Cross-product symbols
are neatly packaged together in CrossProductTokenization.java.
It's easy to add alternative conventions in the future if
anyone needs to.
Right now, Symbols do still have a `name' property, but this
is there for internal and debugging use. For public display,
you should always use a SymbolTokenization. Some people have
suggested that even this propery should go. I'm certainly open
to comments on this issue.
The resulting patch has turned out to be non-trivial -- quite
a lot of code has been touched. There will also need to be
some (although hopefully not too many) changes to application
code. However, I'll argue that the patch makes BioJava's Symbol
code a lot stronger and more robust to future developments. I'd
thus like to see it applied. There are a number of options:
- Apply it straight away, and include all these changes in
BioJava 1.2
- Release 1.2 in the reasonably near future, and apply
this patch in the next development series.
- Just ditch it and keep the status quo (although there
definitely will have to be at least some tidying of
Alphabet creation in the not-too-distant future).
- Something else?
This is an issue which will affect a lot of people, so I'd
like to hear as many views as possible.
You can download the current patched source tree from:
http://www.biojava.org/download/source/biojava-symtoke-20011016.tar.gz
I've ported the existing JUnit test suite across to the
new API, and added a few extra tests for functionality
which wasn't being exercised by existing tests. Everything
is passing cleanly (but more test cases are always welcome...).
There are a few issues which should be resolved before
checking this code into the main tree:
- There's still some cruft left over from the old token
system. This should be tracked down and removed.
- There is some use of a temporary method
AlphabetManager.parse(SymbolTokenization, String). These
calls should probably be replaced by
new SimpleSymbolList(SymbolTokenization, String);
- AllSymbolsAlphabet has been removed. I know some
people (Matthew?) find this very useful, so I should
probably write a replacement.
Anyway, test, hack, and let me know what you think!
Thomas