[Biojava-dev] new seq searching classes

Matthew Pocock matthew.pocock at ncl.ac.uk
Tue Sep 2 12:09:57 EDT 2003


Hi,

I've added a couple of classes in org.biojava.bio.search for finding 
regions of sequence content. They are SeqContentPattern and 
SeqContentMatcher - the API is loosly based upon KMPSearch and the 1.4 
regex libs. These classes aren't javadocked yet.

SeqContentPattern encapsulates the rules about what regions to select - 
the length, and the minimum and maximum number of occurences for each 
nucleotide.

SeqContentMatcher is a cursor produced by scp.matcher(SymbolList) and 
can be used to find the next match, get the matching sub-sequence and to 
discover the offset of that match.

E.g. to find regions of length 10 with at least 8 As, no G or T and at 
most 2 Cs, you could do something like:

SeqContentPattern scp = new SeqContentPattern(DNATools.getDNA());
scp.setLength(10);
scp.setMinCounts(DNATools.a(), 8);
scp.setMaxCounts(DNATools.g(), 0);
scp.setMaxCounts(DNATools.c(), 2);
scp.setMaxCounts(DNATooos.t(), 0);

Then to search with this you'd do something like:

SeqContentMatcher scm = scp.matcher(symList);

while(scm.find()) {
  System.out.println("Hit at: " + scm.pos());
}

Anybody think this is usefull?

Matthew



More information about the biojava-dev mailing list