[Biojava-dev] Regexes on symbol lists - further thoughts

David Huen smh1008 at cus.cam.ac.uk
Mon Apr 12 22:58:52 EDT 2004


I have been considering further the problem of how to permit regexes to be
performed on SymbolLists on arbitrary alphabets.

I list below the options that we have available to us and the pros/cons:-
1) the MotifTools package.  This uses java.util.regex to search a
SymbolList by wrapping it in a SymbolListCharSequence.  It makes it quite
easy to search DNA/RNA/Protein sequences with regexes in that there are
single letter SymbolTokenizations available for all these alphabets.  The
regex can then be easily written as a text string.  The drawback is with
arbitrary alphabets (e.g. DNAxDNA or protein x protein), there is not such
a SymbolTokenization.  Basically, we require a single character
SymbolTokenization for every alphabet.

2) my provisional org.biojava.utils.automata code.  This uses DFAs to
implement a fast way of running regexes and I have now got to the stage
where is it comparable/faster than java.util.regex when used on
SymbolLists in the manner described in (1).  The drawback with DFAs is
that I don't think it is possible to implement the greedy/non-greedy
quantifiers to work exactly as they work in Perl.  It is easy to have
greedy or non-greedy behaviour throughout the regex but not when they are
mixed.  In the latter case, some cases can be made to behave in the
expected manner but not all.  Alternating greedy/non-greedy blocks appear
to be a showstopper. It is possible to use NFAs to do regexes but NFAs are
slower and require backtracking.  However this package will search in any
arbitrary alphabet.

It strikes me that both are incomplete approaches even though they will in
their own way suit 90% of the use cases.

It has struck me that there is one other option open to me.  I could use
the same strategy as (1) except I have a SymbolList -> CharSequence
conversion that does NOT use our default SymbolTokenizers.

Instead, I use the alphabet index to convert the symbol to an equivalent
integer and then convert that integer into a code point in the Unicode
private space (E000-F8FF in UTF-16 which is the Java char space).  The
current search strings in ASCII can be converted to the equivalent in this
alphabet index space too.  Now I should be able to search with regexes in
any alphabet with less than 6399 symbols.  This should be fine in all
cases where we wish to know the location on the sequence where
particular motifs lie.

Does anyone have views/objections they might wish to offer as to whether
this is too ugly to live?

Regards,
David Huen



More information about the biojava-dev mailing list