[Biojava-l] Behavior of the createRegex() method (MotifTool class)

Keith James kdj@sanger.ac.uk
02 Dec 2002 10:44:27 +0000


>>>>> "Matthew" == Matthew Pocock <matthew_pocock@yahoo.co.uk> writes:

    Matthew> Well spotted Sylvain, Keith, there's a method in
    Matthew> AlphabetTools - getAllSymbols(). Feed it with the
    Matthew> matches() map of the symbol & cat together the tokens
    Matthew> from each of these.

I don't think this method is behaving as expected. Passing the
FiniteAlphabets from the following Symbols gets these results:

a -> getMatches() -> getAllSymbols -> tokenize -> -a
c -> getMatches() -> getAllSymbols -> tokenize -> -c
g -> getMatches() -> getAllSymbols -> tokenize -> -g
t -> getMatches() -> getAllSymbols -> tokenize -> -t
n -> getMatches() -> getAllSymbols -> tokenize -> tnn-nannnngncnnn

The code I am using is below (for a motif SymbolList with i Symbols).

Symbol sym = motif.symbolAt(i);
FiniteAlphabet ambiAlpha = (FiniteAlphabet) sym.getMatches();

Symbol [] ambiSyms = (Symbol [])
    AlphabetManager.getAllSymbols(ambiAlpha).toArray(new Symbol[0]);

// getAllSymbols returns a Set (i.e. unordered) so
// we convert to char array so we can sort tokens
char [] ambiChars = new char [ambiSyms.length];

for (int j = 0; j < ambiSyms.length; j++)
{
    ambiChars[j] =
        sToke.tokenizeSymbol(ambiSyms[j]).charAt(0);
}

Arrays.sort(ambiChars);
sb.append(ambiChars);

So the final character class for 'n' comes out as [-acgnnnnnnnnnnnt]

-- 

- Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -