[Biojava-l] Behavior of the createRegex() method (MotifTool class)

Keith James kdj@sanger.ac.uk
01 Dec 2002 18:42:00 +0000


>>>>> "Sylvain" == Sylvain Foisy <sylvain.foisy@bioneq.qc.ca> writes:

    Sylvain> Hi, I used the createRegex() method to return a regular
    Sylvain> expression from a sequence of DNA inputted by the user to
    Sylvain> scan a genome for that motif. I just discovered an
    Sylvain> interesting thing about that method: if n is in the motif
    Sylvain> to seek, the regex will not have n as a possibility.

    Sylvain> Ok, I have that motif: atgnnnndgta.

    Sylvain> CreateRegex would return: atg[atcg]{4}gta and it does

    Sylvain> What if my sequence to scan contains n: atgagcngta, for
    Sylvain> exemple.  Java.util.regex would not find the
    Sylvain> pattern. Unless mistaken, the pattern should be
    Sylvain> atg[atcgn]{4}gta.

    Sylvain> Am I wrong? Any input would be appreciated

You are correct about the behaviour, but not about the solution. An
ambiguous target sequence could contain n, but could also contain r,
y, m, k, s, w, h, b, v and d. To match correctly the regex would have
to take into account that the symbols represented by n are a superset
of those represented by the other ambiguity symbols.

As MotifTools is generic (it will work for any alphabet) implementing
generation of regexes for searching ambiguous SymbolLists requires a
more complex algorithm than the current one. I'll take a look at this
as soon as I can.

Keith

-- 

- Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -