[Biojava-dev] SymbolList tokenization

Schreiber, Mark mark.schreiber at agresearch.co.nz
Sun Aug 24 00:45:29 EDT 2003


Along these lines (well not really but close),
 
If we are looking at a bit of a symbol / alphabet overhaul, I have always been concerned that it is possible to register an Alphabet with the Alphabet manager with the same name as a currenlty registered one.
 
At the moment the only problem would be if someone maliciously registered an Alphabet called DNA which wasn't the DNA you would expect on a remote system (not all that likely, but possible with RMI). If copy constructors are used though then this could be a real problem.
 
I would like to make two suggestions.
 
1) Copy constructors require a new regsitration name as one of the arguments.
     eg makeAlphabet(Alphabet alpha, String name)
 
2) AlphabetManager throughs an exception when an attempt is made to register an alphabet with a name that is already used. This would have to be an unchecked exception or an error as the AlphabetManager doesn't allow for an exception in the  registerAlphabet method. BioError may be appropriate, or we could extend RuntimeException to something like AlphabetManagerException and document that it can occur in rare situations.
 
- Mark
 

	-----Original Message----- 
	From: Thomas Down [mailto:thomas at derkholm.net] 
	Sent: Thu 21/08/2003 10:46 a.m. 
	To: Francois Pepin 
	Cc: biojava-dev at biojava.org 
	Subject: Re: [Biojava-dev] SymbolList tokenization
	
	

	Once upon a time, Francois Pepin wrote:
	> The problem is that toString and seqString have no way of working properly
	> with a lot of the alphabets. Maybe there'd be a better way of making them work
	> (cross-product alphabets are a good example).
	
	That's really the sort of thing that SymbolTokenization.
	tokenizeSymbolList is intended for.
	
	> Matthew's suggestion is to maybe check for the token tokenization and if
	> it fails to go with name tokenization with a space separator.
	
	That's definitely a sensible change to make.
	
	> My case is a bit specific because I need to modify an alphabet by adding a
	> separator at the end (for suffix tree building). Going into the XML file
	> is an extremely ugly way of doing it. I managed to do it by grabbing the
	> old tokenization and adding a couple of binding, but I wouldn't mind a
	> nicer having a more elegant way of doing it.
	>
	> Although the problem is solvable in my specific case, I think that Symbols
	> should know about their token just as they know about their name. Not
	> every Symbol need to have one, but then we should have a way to fall back.
	> Right now we have a bit the worst of both worlds, because we can't easily
	> specify it, and some very basic code (seqString and toString for example)
	> expect it to be there and work.
	
	Hmmm, if anything, I'd actually argue for fixing things the other
	way round -- removing getName from Symbols and having them
	as entirely opaque objects, with the mapping to and from
	textual representations being handled entirely by
	SymbolTokenizations.  In practice, though, a `name' is
	a sufficiently general concept that it's possible to give one
	to most interesting symbols, and it's really helpful to have
	it there for debugging/quick-and-dirty stuff.
	
	The seqString documentation should certainly point to
	tokenizeSymbolList, though.
	
	> Although the ability to create new Alphabets on the fly and do funky
	> things with them isn't often used, I don't think that someone should have
	> to go and specify a new Tokenization manually every time it happens. Using
	> the XML file is nice for standard languages, but I don't think it should
	> be the only elegant way of doing it.
	
	Hmmm, if your concern is primarily about the ease of setting up
	a basic alphabet, would some convenience methods suffice?  For
	example:
	
	     /**
	      * Create and add a new symbol to the specified alphabet,
	      * adding a mapping to the default single-character
	      * SymbolTokenization.
	      */
	
	     public static void createSymbolInAlphabet(
	         SimpleAlphabet alpha,
	         String name,
	         char defaultSingleCharToken
	     );
	
	If we got into this kind of thing, there's actually a whole
	lot which could be usefully streamlined about alphabet
	creation.  Another good one would be a copy-constructor to
	make a new SimpleAlphabet from an existing FiniteAlphabet,
	which would massively simplify cases where you want to add
	a few symbols to a built-in alphabet.
	
	[In general, I think BioJava could benefit from copy-constructors
	in quite a few places]
	
	Is there anything you want to do which really *needs* symbols
	to know their token?
	
	    Thomas.
	_______________________________________________
	biojava-dev mailing list
	biojava-dev at biojava.org
	http://biojava.org/mailman/listinfo/biojava-dev
	


=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================



More information about the biojava-dev mailing list