[Biojava-dev] Regexes and sequencing searching
David Huen
david.huen at ntlworld.com
Wed Apr 7 04:28:28 EDT 2004
I have a need for software that will do a humongous amount of motif
searching on a large number of long sequences. I have tested a large
number of combinations and the conclusions may be of interest.
The trial case is 80 motifs searched on a smallish target of around > 1 Mb.
The time includes all sequence reading/conversions/regex compilations/etc.
Time is CPU secs.
1) Sequence wrapped in SymbolListCharSequence, then searched with
java.util.regex. 17 secs.
2) Sequence converted to SimpleSymbolList, wrapped in
SymbolListCharSequence, then searched with java.util.regex. 14 secs.
3) Sequence converted to String then searched with java.util.regex. 7 secs.
I have also written my own SymbolList regex package (more than one!) that
allows direct searching of sequence in any FiniteAlphabet.
4) Searching by recursive matching through a regex-like structure (an order
of magnitude worse than anything else here!)
5) Compilation of RE -> NFA -> DFA -> ArrayStateMachine 25 secs.
A change in the sequence implementation does not affect my symbollist regex
package as it uses various tricks to reduce slow symbol access.
Conclusions:
1) java.util.regex is bloody fast: do they have a native implementation?
2) the compressed sequence representations we use automatically for large
sequences can affect some kinds of analyses that are dependent on a huge
symbol bandwidth. Even the fastest, most memory intensive representation
is slower than String for individual "symbol" access. Evidently there is a
tradeoff between being able to have the whole sequence in RAM and swapping
and comparative sizes of our packed representation and a String to consider
in any final decision.
Also, do we really want a direct SymbolList regex package?
Pros:
Can search in any alphabet including those that cannot be handled with
java.util.regex because of the absence of a suitable tokenization (large
numbers of symbols). Search alignments in protein cross-product space?
Cons:
yet another package to maintain.
Slower than java.util.regex.
Regards,
David Huen
More information about the biojava-dev
mailing list