[Biojava-l] Regexp matching performance

Michael B. Allen mballen@erols.com
Wed, 16 May 2001 18:14:24 -0400


On Wed, May 16, 2001 at 03:47:01PM -0400, Sharon L. Cousins wrote:
> 
> I have rewritten our Python implementation of sequence record parsers and 
> filters (for both Genbank and EMBL format ) in Java and have narrowed down poor 
> performance problems to regular expression matching.

If you're looking for performance, parsers should not use regular
expressions at all really. Even when using the String class beware
that when you do something like str.substring( x ), you're creating
a new object ... etc(If you *really* want performace, read from
BufferedInputStream character bby character and do real c-style
parsing). But for line oriented formats like the ones you're parsing
BufferedReader.readLine is the most practical solution. Still, regular
expressions in a parser is not a great idea.

Now if you're searching for an expression in the sequence data that's an
entirely different problem where performace will differer dramatically
depending on the expression and size of the dataset. But I suspect you
know that and that's not what you're doing.

Mike

-- 
signature pending