[Biojava-dev] SymbolList tokenization

Francois Pepin fpepin at cs.mcgill.ca
Wed Aug 20 14:45:12 EDT 2003


The problem is that toString and seqString have no way of working properly
with a lot of the alphabets. Maybe there'd be a better way of making them work
(cross-product alphabets are a good example).

Matthew's suggestion is to maybe check for the token tokenization and if
it fails to go with name tokenization with a space separator.

My case is a bit specific because I need to modify an alphabet by adding a
separator at the end (for suffix tree building). Going into the XML file
is an extremely ugly way of doing it. I managed to do it by grabbing the
old tokenization and adding a couple of binding, but I wouldn't mind a
nicer having a more elegant way of doing it.

Although the problem is solvable in my specific case, I think that Symbols
should know about their token just as they know about their name. Not
every Symbol need to have one, but then we should have a way to fall back.
Right now we have a bit the worst of both worlds, because we can't easily
specify it, and some very basic code (seqString and toString for example)
expect it to be there and work.

Although the ability to create new Alphabets on the fly and do funky
things with them isn't often used, I don't think that someone should have
to go and specify a new Tokenization manually every time it happens. Using
the XML file is nice for standard languages, but I don't think it should
be the only elegant way of doing it.

Francois

Once upon a time, Francois Pepin wrote:
> Hi everyone,
>
> Would anyone have problems with redefining a bit how tokenizers work?
The
> current way is quite complicated if someone wants to work with a custom
> alphabet. Trying to tokenize an DNAxDNAxDNA SymbolList also fails
because
> no tokenizer is defined for that alphabet.
>
> For the "token" tokenization, I think it would be more sensible to have
> the default ask the Symbol to see what their character token is. After
> all, if the Symbols are responsible for knowing their own name, they
> should also be responsible to know their own 1-letter code.
>
> The mehods ar there to create Symbols with a character token, but
they're
> deprecated. I think that those methods should still be used. And then we
> could have a default "token" tokenization that just asks the symbols
what
> is their preferred token.

Symbols knowing their default tokens is how things used to
work in BioJava <= 1.1 (that's why deprecated constructors,
etc., are there).  It never seemed to work very well...
Unless there are extremely strong argument, I'd really
prefer not to go there ever again.

In particular, if you put a getToken() type method on Symbols,
then the expectation is that all Symbols will return something
useful.  But for many types of symbol (numbers, symbols from
large cross-product alphabets, etc.) this is nonsensical or
even impossible.

I'd agree that setting up tokenizations isn't documented very
well.  I actually think that most people who want to work
with custom alphabets should be doing it via XML files
rather than programmatically, but this isn't documented
at all.  I'll do a BJIA tutorial for that.

In a nutshell, it's:

    <alphabet name="binary">
      <symbol name="true" />
      <symbol name="false" />
      <characterTokenization name="token" caseSensitive="false">
        <atomicMapping token="1">
          <symbolref name="true" />
        </atomicMapping>
        <atomicMapping token="0">
          <symbolref name="false" />
        </atomicMapping>
      </characterTokenization>
    </alphabet>

Then use AlphabetManager.loadAlphabets.

How would you go about define a tokenization for a cross-product
alphabet?  I don't think shifting knowledge of tokens back to
the Symbols is going to help here at all.

    Thomas.





More information about the biojava-dev mailing list