[Biojava-dev] SymbolList tokenization
Thomas Down
thomas at derkholm.net
Wed Aug 20 18:46:35 EDT 2003
Once upon a time, Francois Pepin wrote:
> The problem is that toString and seqString have no way of working properly
> with a lot of the alphabets. Maybe there'd be a better way of making them work
> (cross-product alphabets are a good example).
That's really the sort of thing that SymbolTokenization.
tokenizeSymbolList is intended for.
> Matthew's suggestion is to maybe check for the token tokenization and if
> it fails to go with name tokenization with a space separator.
That's definitely a sensible change to make.
> My case is a bit specific because I need to modify an alphabet by adding a
> separator at the end (for suffix tree building). Going into the XML file
> is an extremely ugly way of doing it. I managed to do it by grabbing the
> old tokenization and adding a couple of binding, but I wouldn't mind a
> nicer having a more elegant way of doing it.
>
> Although the problem is solvable in my specific case, I think that Symbols
> should know about their token just as they know about their name. Not
> every Symbol need to have one, but then we should have a way to fall back.
> Right now we have a bit the worst of both worlds, because we can't easily
> specify it, and some very basic code (seqString and toString for example)
> expect it to be there and work.
Hmmm, if anything, I'd actually argue for fixing things the other
way round -- removing getName from Symbols and having them
as entirely opaque objects, with the mapping to and from
textual representations being handled entirely by
SymbolTokenizations. In practice, though, a `name' is
a sufficiently general concept that it's possible to give one
to most interesting symbols, and it's really helpful to have
it there for debugging/quick-and-dirty stuff.
The seqString documentation should certainly point to
tokenizeSymbolList, though.
> Although the ability to create new Alphabets on the fly and do funky
> things with them isn't often used, I don't think that someone should have
> to go and specify a new Tokenization manually every time it happens. Using
> the XML file is nice for standard languages, but I don't think it should
> be the only elegant way of doing it.
Hmmm, if your concern is primarily about the ease of setting up
a basic alphabet, would some convenience methods suffice? For
example:
/**
* Create and add a new symbol to the specified alphabet,
* adding a mapping to the default single-character
* SymbolTokenization.
*/
public static void createSymbolInAlphabet(
SimpleAlphabet alpha,
String name,
char defaultSingleCharToken
);
If we got into this kind of thing, there's actually a whole
lot which could be usefully streamlined about alphabet
creation. Another good one would be a copy-constructor to
make a new SimpleAlphabet from an existing FiniteAlphabet,
which would massively simplify cases where you want to add
a few symbols to a built-in alphabet.
[In general, I think BioJava could benefit from copy-constructors
in quite a few places]
Is there anything you want to do which really *needs* symbols
to know their token?
Thomas.
More information about the biojava-dev
mailing list