[Biojava-l] Anyon has ever search for CpG Islands with BioJava?

Stein Aerts stein.aerts at esat.kuleuven.ac.be
Fri Mar 28 10:29:33 EST 2003

Hi Sylvain,

This piece of code should do it. The nrCGRandom changes with the 
windowlength though (I calculated this value from EPD a long time ago). 
I use org.apache.oro.text.regex.*;



public void annotateCpG(Sequence seq) throws Exception{
        double nrCGRandom = 14.1048d;
        int windowLength = 200;
        int step=100;
        String motif = "cg";
        PatternMatcher matcher=new Perl5Matcher();
        PatternCompiler compiler = new Perl5Compiler();
        Pattern pattern = null;
        PatternMatcherInput input;
        MatchResult result;
        pattern = compiler.compile(motif);
        int nrOfOcc=0;
        int nrConsec=0;
        double gcContent = 0d;
        double cgValue=0d;
        int index = 1;
        String dnaStr;
        SymbolList window= null;
        int i=1;
        while (index<seq.length()-windowLength){
          window = seq.subList(index,index+windowLength);
          dnaStr = window.seqString();
          input   = new PatternMatcherInput(dnaStr);
          while(matcher.contains(input, pattern)) {
          gcContent = GCContent(window);
          cgValue = (nrCgRandom/nrOfOcc);
          if (cgValue<0.6d && gcContent>50.0d){
          else if (nrConsec>0){
            int start = (index-(nrConsec)*step);
            int stop = index;
            Feature.Template template = new Feature.Template();
            template.type = "misc_feature";
            template.location = new RangeLocation(start,stop);
            template.annotation = new SimpleAnnotation();
            template.annotation.setProperty("type","CpG Island");

 public double GCContent(SymbolList seq){
    int gc = 0;
    for (int pos = 1; pos <= seq.length(); ++pos) {
        org.biojava.bio.symbol.Symbol sym = seq.symbolAt(pos);
        if (sym == org.biojava.bio.seq.DNATools.g() || sym == 
    double content = (gc*100)/seq.length();
    return content;

Sylvain Foisy wrote:

>Sorry for the hand holding question. I am trying to wrap the output of a
>Perl script that finds CpG islands (cpgi130.pl) using Runtime.exec with
>little success. Has anyone ever built a class that can find CpG islands in
>Sequence objects and build features with RangeLocations?
>Other solution: has anyone ever try to write a wrapper that would use the
>output of any program of the EMBOSS suite? It has some CpG island finding
>Anyhelp would be much appreciated
>P.S.: there has been a call for an algorithm repository on Usenet. Would it
>be possible/interesting to create such a repository for classes built with
>BioJava that could be of interest to other rookies like me.
>Just an idea...
>Sylvain Foisy, Ph. D.
>Directeur - operations / Project Manager
>BioneQ - Reseau quebecois de bio-informatique
>U. de Montreal / Genome-Quebec
>Addresse postale:
>Departement de biochimie
>Pavillon principal
>2900, boul. Édouard-Montpetit
>Montréal (Québec) H3T 1J4
>Tel: (514) 343-6111 x.4066
>Courriel: sylvain.foisy at bioneq.qc.ca
>Biojava-l mailing list  -  Biojava-l at biojava.org

Stein Aerts BioI at SISTA
K.U.Leuven ESAT-SCD Belgium

More information about the Biojava-l mailing list