[Biojava-dev] SymbolList tokenization

Thomas Down thomas at derkholm.net
Wed Aug 20 11:45:19 EDT 2003


Once upon a time, Francois Pepin wrote:
> Hi everyone,
> 
> Would anyone have problems with redefining a bit how tokenizers work? The
> current way is quite complicated if someone wants to work with a custom
> alphabet. Trying to tokenize an DNAxDNAxDNA SymbolList also fails because
> no tokenizer is defined for that alphabet.
>
> For the "token" tokenization, I think it would be more sensible to have
> the default ask the Symbol to see what their character token is. After
> all, if the Symbols are responsible for knowing their own name, they
> should also be responsible to know their own 1-letter code.
>
> The mehods ar there to create Symbols with a character token, but they're
> deprecated. I think that those methods should still be used. And then we
> could have a default "token" tokenization that just asks the symbols what
> is their preferred token.

Symbols knowing their default tokens is how things used to
work in BioJava <= 1.1 (that's why deprecated constructors,
etc., are there).  It never seemed to work very well...
Unless there are extremely strong argument, I'd really
prefer not to go there ever again.

In particular, if you put a getToken() type method on Symbols,
then the expectation is that all Symbols will return something
useful.  But for many types of symbol (numbers, symbols from
large cross-product alphabets, etc.) this is nonsensical or
even impossible.

I'd agree that setting up tokenizations isn't documented very
well.  I actually think that most people who want to work
with custom alphabets should be doing it via XML files
rather than programmatically, but this isn't documented
at all.  I'll do a BJIA tutorial for that.

In a nutshell, it's:

    <alphabet name="binary">
      <symbol name="true" />
      <symbol name="false" />
      <characterTokenization name="token" caseSensitive="false">
        <atomicMapping token="1">
          <symbolref name="true" />
        </atomicMapping>
        <atomicMapping token="0">
          <symbolref name="false" />
        </atomicMapping>
      </characterTokenization>
    </alphabet>

Then use AlphabetManager.loadAlphabets.

How would you go about define a tokenization for a cross-product
alphabet?  I don't think shifting knowledge of tokens back to
the Symbols is going to help here at all.

    Thomas.


More information about the biojava-dev mailing list