[Biojava-dev] SymbolList tokenization

Francois Pepin frpepin at attglobal.net
Thu Aug 21 00:32:45 EDT 2003


As long as things behave in a sensible manner, I don't feel that
strongly about it.

I think it would've been easier to deal with it by letting Symbol take
care of itself, rather than to have the machinery around it having to
think about everything. After all, this is how things are being handled
in the XML file, so I'm probably not the only person thinking in that
way. Maybe a way would be to use the same machinery to define beefed-up
Symbols that know about everything and then have the Alphabet created
around a set of dumb Symbols.

The solutions that are offering would do the trick quite well for my
case. I basically ended up implementing code to do it (putting it in a
method would be quite easy). I'll add them tomorrow.

I'll go and modify the SeqString code as well. First it will try to use
the "token" Tokenization and if there are no such Tokenization or if
it's missing some Symbols, then it'll fall back to a name Tokenization.

Out of curiosity, is there a built-in way to get all of the Symbols that
actually exist in a SymbolList (iterating through them and pitching them
in a Set would do the trick pretty easily for short sequences)? It might
take things easier for some manipulations (especially with large
Alphabets), but it could also be inconvenient to depend on them for long
sequences.

Francois

-----Original Message-----
From: biojava-dev-bounces at portal.open-bio.org
[mailto:biojava-dev-bounces at portal.open-bio.org] On Behalf Of Thomas
Down
Sent: 20 aout, 2003 18:47
To: Francois Pepin
Cc: biojava-dev at biojava.org
Subject: Re: [Biojava-dev] SymbolList tokenization


Once upon a time, Francois Pepin wrote:
> The problem is that toString and seqString have no way of working 
> properly with a lot of the alphabets. Maybe there'd be a better way of

> making them work (cross-product alphabets are a good example).

That's really the sort of thing that SymbolTokenization.
tokenizeSymbolList is intended for.

> Matthew's suggestion is to maybe check for the token tokenization and 
> if it fails to go with name tokenization with a space separator.

That's definitely a sensible change to make.

> My case is a bit specific because I need to modify an alphabet by 
> adding a separator at the end (for suffix tree building). Going into 
> the XML file is an extremely ugly way of doing it. I managed to do it 
> by grabbing the old tokenization and adding a couple of binding, but I

> wouldn't mind a nicer having a more elegant way of doing it.
> 
> Although the problem is solvable in my specific case, I think that 
> Symbols should know about their token just as they know about their 
> name. Not every Symbol need to have one, but then we should have a way

> to fall back. Right now we have a bit the worst of both worlds, 
> because we can't easily specify it, and some very basic code 
> (seqString and toString for example) expect it to be there and work.

Hmmm, if anything, I'd actually argue for fixing things the other way
round -- removing getName from Symbols and having them as entirely
opaque objects, with the mapping to and from textual representations
being handled entirely by SymbolTokenizations.  In practice, though, a
`name' is a sufficiently general concept that it's possible to give one
to most interesting symbols, and it's really helpful to have it there
for debugging/quick-and-dirty stuff.

The seqString documentation should certainly point to
tokenizeSymbolList, though.

> Although the ability to create new Alphabets on the fly and do funky 
> things with them isn't often used, I don't think that someone should 
> have to go and specify a new Tokenization manually every time it 
> happens. Using the XML file is nice for standard languages, but I 
> don't think it should be the only elegant way of doing it.

Hmmm, if your concern is primarily about the ease of setting up a basic
alphabet, would some convenience methods suffice?  For
example:

     /**
      * Create and add a new symbol to the specified alphabet,
      * adding a mapping to the default single-character
      * SymbolTokenization.
      */

     public static void createSymbolInAlphabet(
         SimpleAlphabet alpha,
         String name,
         char defaultSingleCharToken
     );

If we got into this kind of thing, there's actually a whole
lot which could be usefully streamlined about alphabet creation.
Another good one would be a copy-constructor to make a new
SimpleAlphabet from an existing FiniteAlphabet, which would massively
simplify cases where you want to add a few symbols to a built-in
alphabet.

[In general, I think BioJava could benefit from copy-constructors in
quite a few places]

Is there anything you want to do which really *needs* symbols to know
their token?

    Thomas.
_______________________________________________
biojava-dev mailing list
biojava-dev at biojava.org http://biojava.org/mailman/listinfo/biojava-dev



More information about the biojava-dev mailing list