[Biojava-l] SymbolTokenization patch

Tue, 16 Oct 2001 14:36:00 +0100

Hi...

We've been seeing a number of problem recently relating
to the construction of alphabets, and in particular binding
names to ambiguous and gap symbols.  I'd like to propose
a patch which addresses a number of issues in this area,
and should leave us with a more solid Symbol/Alphabet
infrastructure.

The idea is:

  - Remove the `token' (single-character name) property from
    the Symbol interface.  This was problematic since it was
    undefined in many cases (especially cross-product alphabets
    where there might well be more symbols than ASCII characters).

  - Replace the old SymbolParser (one way string -> symbol) map
    with a new interface, SymbolTokenization (a two-way
    string <--> symbol map).

In doing so, all sorts of cruft dies.  In particular:

  - Alphabet creation can be simplified.

  - We can get back to the idealized situation whereby alphabets
    IMPLICITLY contain all possible ambiguity symbols (including
    the gap symbol), and no longer have to pre-seed ambiguity
    symbols where they have tokens associated with them (e.g.
    all DNA ambiguities, and a sub-set of protein ambiguities).

  - We are able to handle multiple naming schemes (for instance,
    arguments about what character to use for the gap symbol)
    in a clean, transparent way.  The underlying Symbols can remain
    the same whichever naming scheme (==SymbolTokenization) you
    use.

  - All the conventions for naming of Cross-product symbols
    are neatly packaged together in CrossProductTokenization.java.
    It's easy to add alternative conventions in the future if
    anyone needs to.

Right now, Symbols do still have a `name' property, but this
is there for internal and debugging use.  For public display,
you should always use a SymbolTokenization.  Some people have
suggested that even this propery should go.  I'm certainly open
to comments on this issue.

The resulting patch has turned out to be non-trivial -- quite
a lot of code has been touched.  There will also need to be
some (although hopefully not too many) changes to application
code.  However, I'll argue that the patch makes BioJava's Symbol
code a lot stronger and more robust to future developments.  I'd
thus like to see it applied.  There are a number of options:

  - Apply it straight away, and include all these changes in
    BioJava 1.2

  - Release 1.2 in the reasonably near future, and apply
    this patch in the next development series.

  - Just ditch it and keep the status quo (although there
    definitely will have to be at least some tidying of
    Alphabet creation in the not-too-distant future).

  - Something else?

This is an issue which will affect a lot of people, so I'd
like to hear as many views as possible.

You can download the current patched source tree from:

  http://www.biojava.org/download/source/biojava-symtoke-20011016.tar.gz

I've ported the existing JUnit test suite across to the
new API, and added a few extra tests for functionality
which wasn't being exercised by existing tests.  Everything
is passing cleanly (but more test cases are always welcome...).

There are a few issues which should be resolved before
checking this code into the main tree:

  - There's still some cruft left over from the old token
    system.  This should be tracked down and removed.

  - There is some use of a temporary method
    AlphabetManager.parse(SymbolTokenization, String).  These
    calls should probably be replaced by

         new SimpleSymbolList(SymbolTokenization, String);

  - AllSymbolsAlphabet has been removed.  I know some
    people (Matthew?) find this very useful, so I should
    probably write a replacement.

Anyway, test, hack, and let me know what you think!

    Thomas