[Biojava-l] Ambiguity codes

Matthew Pocock mrp@sanger.ac.uk
Wed, 21 Jun 2000 17:39:25 +0100


Dear all,

I have been tying myself up in knots over ambiguity symbols. The
alphabet formulation seems to work very well in practice, untill you
start trying to work with alphabets that contain ambiguity codes. Then,
because we currently add the ambiguity symbol to the alphabet,
everything gets very complicated. This is mildly anoying for DNA, and
realy bad for proteins.

I am going to add the interface AmbiguitySymbol which extends Symbol and
adds the method getMatchingAlphabet(), which is an alphabet containing
the symbols that the ambiguity code matches. An alphabet's iterator
method will iterate over the 'core' symbols only, but it will 'contain'
any Ambiguity symbol that only matches things that it contains. A
side-benefit of this is that the 'gap' symbol becomes an AmbiguitySymbol
that matches nothing.

example:

FiniteAlphabet dna = DNATools.getAlphabet();
FiniteAlphabet rna = RNATools.getAlphabet(); // ok - so we haven't
written this yet
AmbiguitySymbol dna_n = DNATools.n(); // get the 'n' ambiguity code for
DNA

dna.contains(dna_n); // true as n is {a, g, c, t} and dna contains each
of these
rna.contains(dna_n); // false as n is {a, g, c, t} and rna has u, not t

// make a new ambiguity symbol
AmbiguitySymbol sym = AlphabetManager.instance().ambiguity(
  Arrays.asList(
    new Symbol[] {DNATools.a(),
                            DNATools.g()}
  )
);

dna.contains(sym); // true - it does contain a and g
rna.contains(sym); // also true

// make a new ambiguity symbol that adds c to sym
AmbiguitySymbol sym = AlphabetManager.instance().ambiguity(
  Arrays.asList(
    new Symbol[] {sym,
                            DNATools.c()}
  )
);

What do you think?
--
Joon: You're out of your tree
Sam:  It wasn't my tree
                                                 (Benny & Joon)