[Biojava-dev] Regexes and sequencing searching

Wed Apr 7 04:28:28 EDT 2004

I have a need for software that will do a humongous amount of motif 
searching on a large number of long sequences.  I have tested a large 
number of combinations and the conclusions may be of interest.

The trial case is 80 motifs searched on a smallish target of around > 1 Mb.
The time includes all sequence reading/conversions/regex compilations/etc.  
Time is CPU secs.

1) Sequence wrapped in SymbolListCharSequence, then searched with 
java.util.regex.	17 secs.
2) Sequence converted to SimpleSymbolList, wrapped in 
SymbolListCharSequence, then searched with java.util.regex. 14 secs.
3) Sequence converted to String then searched with java.util.regex.  7 secs.

I have also written my own SymbolList regex package (more than one!) that 
allows direct searching of sequence in any FiniteAlphabet.
4) Searching by recursive matching through a regex-like structure (an order 
of magnitude worse than anything else here!)
5) Compilation of RE -> NFA -> DFA -> ArrayStateMachine 25 secs.
A change in the sequence implementation does not affect my symbollist regex 
package as it uses various tricks to reduce slow symbol access.

Conclusions:
1) java.util.regex is bloody fast: do they have a native implementation? 
2) the compressed sequence representations we use automatically for large 
sequences can affect some kinds of analyses that are dependent on a huge 
symbol bandwidth.  Even the fastest, most memory intensive representation 
is slower than String for individual "symbol" access.  Evidently there is a 
tradeoff between being able to have the whole sequence in RAM and swapping 
and comparative sizes of our packed representation and a String to consider 
in any final decision.

Also, do we really want a direct SymbolList regex package?
Pros: 
Can search in any alphabet including those that cannot be handled with 
java.util.regex because of the absence of a suitable tokenization (large 
numbers of symbols).  Search alignments in protein cross-product space?
Cons: 
yet another package to maintain.
Slower than java.util.regex.

Regards,
David Huen