[Biojava-l] Suffix Trees
Matthew Pocock
matthew_pocock@yahoo.co.uk
Thu, 18 Apr 2002 11:38:27 +0100
step1b@cyberspace.org wrote:
> Hi
>
> Are there any examples which use SuffixTree class?
> does the implementation of SuffixTree class assume the size of finite alphabet
> is small(ie 4).?
>
> thanks,
> st.
> _______________________________________________
> Biojava-l mailing list - Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>
Hi St,
The suffix tree class is quite old now. It could probably do with an
overhaul by someone who knows how. It assumes that the alphabet is
small, and stores child references in an array. There are other possible
implementations (e.g. nextSibling/children references), but this is the
most compact for a small alphabet where any given level will
more-than-likely be nearly fully populated. Navigation methods in
SuffixTree will safely create nodes with a zero count if you walk to
them. Methods in SuffixTree.SuffixNode will return nulls if you walk off
the tree. I used it to build kernel functions over strings for
support-vector machines, but it does eat memory (lengths > 16 just
couldn't be done).
You use it something like:
// make a new random bit of DNA
List list = new ArrayList(length);
for(int i = 0; i < length; i++) {
list.add(DNATools.forIndex((int) (4.0*Math.random())));
}
SymbolList symL = SimpleSymbolList(DNATools.getDNA(), list);
// make a suffix tree
SuffixTree sTree = new SuffixTree(DNATools.getDNA());
// add symL to the suffix tree, capping it at depth 15
sTree.addSymbols(symL, 15);
// read out the frequency of AGG
SuffixTree.SuffixNode node = sTree.getRoot();
// navigate tree using sTree methods - this is safe if the
// sub-string wasn't in the original seq
node = sTree.getChild(node, DNATools.a());
node = sTree.getChild(node, DNATools.g());
node = sTree.getChild(node, DNATools.g());
// print out its frequency
System.out.println("AGG frequency: " + node.getNumber());