[Biojava-l] regex performance in Java
hlapp at drycafe.net
Thu Oct 25 03:45:52 UTC 2012
The code is a very small snippet from a natural language processing software aimed at extracting structured phenotype descriptions from un- or semistructured free text. Apparently the code as is (in Perl) makes a lot of regular expression matches, and so if the speed difference for them between Perl and Java is significant, in theory this might become a problem. Though whether it will or will not amount to a bottleneck indeed remains to be seen, as the code is also doing other things that are potentially expensive, and possibly more so than the regex matching.
So the exercise here is merely to see whether there is a notable performance difference in regex pattern evaluation that can't simply be attributed to programming mistakes (and apparently there is).
On Oct 24, 2012, at 2:30 PM, P. Troshin wrote:
> Hi Hilmar,
> Looked at the test in a bit more details, I can see what you are
> trying to test but is there a real life problem behind this?
> What this test is doing is a lot of searches on very short strings. Is
> this what your real life application does? I am asking because if your
> real life application uses regexp to look into long string, the
> performance might be totally different.
> What is your aim - 3 seconds for 500K searches do not seem
> particularly slow to me.
> On 24 October 2012 19:10, P. Troshin <to.petr at gmail.com> wrote:
>> Hi Hilmar,
>> Hmm, it looks like I spoke too soon; the previous run was doing
>> nothing as all of the cases were commented out.
>> I can now see that the results of my runs are not massively different
>> from that of yours.
>> It would help if you could encourage your student to write a few unit
>> tests so that we know what you are trying to achieve and to simplify
>> the testing.
>> Just a thought
>> On 24 October 2012 17:47, Hilmar Lapp <hlapp at drycafe.net> wrote:
>>> Hi everyone,
>>> Thanks for all your responses. Indeed I know that the Java regex API isn't an enjoyable one to program with, and if the underlying task were about writing something from scratch, I'd be all for avoiding regex's too if the same thing could be achieved by string comparison.
>>> However, and of course I failed to say that initially, the task from which this query is originating is about converting a Perl script to Java (not because Perl is somehow bad, but because those Perl scripts have shown to be an obstacle to easy cross-platform installation of the - mostly Java - software they are a part of). That doesn't mean one couldn't in the course also rewrite the code that uses regular expressions to one that doesn't, but I also think it wise not to introduce multiple variables as a source of error at once.
>>> Some of the responses would be best answered by looking at the expressions and the code that uses them, so here are the two "benchmark" scripts.
>>> Java: https://gist.github.com/3940931
>>> Perl: https://gist.github.com/3940780
>>> I'm also copying Dongye Meng here, who is a CS student at UNC working with us on the project - if anyone has further wisdom to share about how to reduce the performance gap between the two versions, he'd surely appreciate.
>>> On Oct 23, 2012, at 6:42 AM, Phillip Lord wrote:
>>>> Hilmar Lapp <hlapp at drycafe.net> writes:
>>>>> They (at least as in java.util.regex) have been reported to me as
>>>>> performing much slower (by several orders of magnitude) than the regex
>>>>> implementation in Perl, and some simple benchmarking tests seem to
>>>>> bear that out. Even after scrutinizing the benchmark and finding
>>>>> nothing obvious, I'm still skeptical as to why this would be the case
>>>>> - naively I would have assumed that the underlying runtime library is
>>>>> implemented in C in both cases. But perhaps this is not true?
>>>> Well, the difference is that Perl is perl, while Java is not; it all
>>>> depends on the JVM, and libraries also. A quick shuftie at
>>>> the source for the open-jdk libraries suggests that the regexp searching
>>>> is done in Java -- it's not just a drop through to C. Always the problem
>>>> with performance optimisation on Java -- you are only optimising for one
>>>> situation. It might be interesting to see how much variation there is
>>>> between JVMs.
>>>> Like others, I would only use regexp as a last resort in Java anyway;
>>>> compared to Perl, writing the code is painful. Still, I guess that you
>>>> know this!
>>> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
More information about the Biojava-l