[Biojava-dev] SymbolList tokenization

Thomas Down thomas at derkholm.net
Wed Aug 20 18:46:35 EDT 2003


Once upon a time, Francois Pepin wrote:
> The problem is that toString and seqString have no way of working properly
> with a lot of the alphabets. Maybe there'd be a better way of making them work
> (cross-product alphabets are a good example).

That's really the sort of thing that SymbolTokenization.
tokenizeSymbolList is intended for.

> Matthew's suggestion is to maybe check for the token tokenization and if
> it fails to go with name tokenization with a space separator.

That's definitely a sensible change to make.

> My case is a bit specific because I need to modify an alphabet by adding a
> separator at the end (for suffix tree building). Going into the XML file
> is an extremely ugly way of doing it. I managed to do it by grabbing the
> old tokenization and adding a couple of binding, but I wouldn't mind a
> nicer having a more elegant way of doing it.
> 
> Although the problem is solvable in my specific case, I think that Symbols
> should know about their token just as they know about their name. Not
> every Symbol need to have one, but then we should have a way to fall back.
> Right now we have a bit the worst of both worlds, because we can't easily
> specify it, and some very basic code (seqString and toString for example)
> expect it to be there and work.

Hmmm, if anything, I'd actually argue for fixing things the other
way round -- removing getName from Symbols and having them
as entirely opaque objects, with the mapping to and from
textual representations being handled entirely by
SymbolTokenizations.  In practice, though, a `name' is
a sufficiently general concept that it's possible to give one
to most interesting symbols, and it's really helpful to have
it there for debugging/quick-and-dirty stuff.

The seqString documentation should certainly point to
tokenizeSymbolList, though.

> Although the ability to create new Alphabets on the fly and do funky
> things with them isn't often used, I don't think that someone should have
> to go and specify a new Tokenization manually every time it happens. Using
> the XML file is nice for standard languages, but I don't think it should
> be the only elegant way of doing it.

Hmmm, if your concern is primarily about the ease of setting up
a basic alphabet, would some convenience methods suffice?  For
example:

     /**
      * Create and add a new symbol to the specified alphabet,
      * adding a mapping to the default single-character
      * SymbolTokenization.
      */

     public static void createSymbolInAlphabet(
         SimpleAlphabet alpha,
         String name,
         char defaultSingleCharToken
     );

If we got into this kind of thing, there's actually a whole
lot which could be usefully streamlined about alphabet
creation.  Another good one would be a copy-constructor to
make a new SimpleAlphabet from an existing FiniteAlphabet,
which would massively simplify cases where you want to add
a few symbols to a built-in alphabet.

[In general, I think BioJava could benefit from copy-constructors
in quite a few places]

Is there anything you want to do which really *needs* symbols
to know their token?

    Thomas.


More information about the biojava-dev mailing list