[Biojava-l] Suffix Trees

Matthew Pocock matthew_pocock@yahoo.co.uk
Thu, 18 Apr 2002 11:38:27 +0100


step1b@cyberspace.org wrote:
> Hi
> 
> Are there any examples which use SuffixTree class?
> does the implementation of SuffixTree class assume the size of finite alphabet
> is small(ie 4).?
> 
> thanks,
> st.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

Hi St,

The suffix tree class is quite old now. It could probably do with an 
overhaul by someone who knows how. It assumes that the alphabet is 
small, and stores child references in an array. There are other possible 
implementations (e.g. nextSibling/children references), but this is the 
most compact for a small alphabet where any given level will 
more-than-likely be nearly fully populated. Navigation methods in 
SuffixTree will safely create nodes with a zero count if you walk to 
them. Methods in SuffixTree.SuffixNode will return nulls if you walk off 
the tree. I used it to build kernel functions over strings for 
support-vector machines, but it does eat memory (lengths > 16 just 
couldn't be done).

You use it something like:

// make a new random bit of DNA
List list = new ArrayList(length);
for(int i = 0; i < length; i++) {
    list.add(DNATools.forIndex((int) (4.0*Math.random())));
}
SymbolList symL = SimpleSymbolList(DNATools.getDNA(), list);

// make a suffix tree
SuffixTree sTree = new SuffixTree(DNATools.getDNA());

// add symL to the suffix tree, capping it at depth 15
sTree.addSymbols(symL, 15);

// read out the frequency of AGG
SuffixTree.SuffixNode node = sTree.getRoot();

// navigate tree using sTree methods - this is safe if the
// sub-string wasn't in the original seq
node = sTree.getChild(node, DNATools.a());
node = sTree.getChild(node, DNATools.g());
node = sTree.getChild(node, DNATools.g());

// print out its frequency
System.out.println("AGG frequency: " + node.getNumber());