[Biojava-l] regex performance in Java

P. Troshin to.petr at gmail.com
Wed Oct 24 18:30:16 UTC 2012


Hi Hilmar,

Looked at the test in a bit more details, I can see what you are
trying to test but is there a real life problem behind this?
What this test is doing is a lot of searches on very short strings. Is
this what your real life application does? I am asking because if your
real life application uses regexp to look into long string, the
performance might be totally different.
What is your aim - 3 seconds for 500K searches do not seem
particularly slow to me.

Thanks
Peter


On 24 October 2012 19:10, P. Troshin <to.petr at gmail.com> wrote:
> Hi Hilmar,
>
> Hmm, it looks like I spoke too soon; the previous run was doing
> nothing as all of the cases were commented out.
> I can now see that the results of my runs are not massively different
> from that of yours.
> It would help if you could encourage your student to write a few unit
> tests so that we know what you are trying to achieve and to simplify
> the testing.
>
> Just a thought
>
> Thanks,
> Peter
>
>
>
> On 24 October 2012 17:47, Hilmar Lapp <hlapp at drycafe.net> wrote:
>> Hi everyone,
>>
>> Thanks for all your responses. Indeed I know that the Java regex API isn't an enjoyable one to program with, and if the underlying task were about writing something from scratch, I'd be all for avoiding regex's too if the same thing could be achieved by string comparison.
>>
>> However, and of course I failed to say that initially, the task from which this query is originating is about converting a Perl script to Java (not because Perl is somehow bad, but because those Perl scripts have shown to be an obstacle to easy cross-platform installation of the - mostly Java - software they are a part of). That doesn't mean one couldn't in the course also rewrite the code that uses regular expressions to one that doesn't, but I also think it wise not to introduce multiple variables as a source of error at once.
>>
>> Some of the responses would be best answered by looking at the expressions and the code that uses them, so here are the two "benchmark" scripts.
>>
>> Java: https://gist.github.com/3940931
>> Perl: https://gist.github.com/3940780
>>
>> I'm also copying Dongye Meng here, who is a CS student at UNC working with us on the project - if anyone has further wisdom to share about how to reduce the performance gap between the two versions, he'd surely appreciate.
>>
>>         -hilmar
>>
>> On Oct 23, 2012, at 6:42 AM, Phillip Lord wrote:
>>
>>> Hilmar Lapp <hlapp at drycafe.net> writes:
>>>> They (at least as in java.util.regex) have been reported to me as
>>>> performing much slower (by several orders of magnitude) than the regex
>>>> implementation in Perl, and some simple benchmarking tests seem to
>>>> bear that out. Even after scrutinizing the benchmark and finding
>>>> nothing obvious, I'm still skeptical as to why this would be the case
>>>> - naively I would have assumed that the underlying runtime library is
>>>> implemented in C in both cases. But perhaps this is not true?
>>>
>>>
>>> Well, the difference is that Perl is perl, while Java is not; it all
>>> depends on the JVM, and libraries also. A quick shuftie at
>>> the source for the open-jdk libraries suggests that the regexp searching
>>> is done in Java -- it's not just a drop through to C. Always the problem
>>> with performance optimisation on Java -- you are only optimising for one
>>> situation. It might be interesting to see how much variation there is
>>> between JVMs.
>>>
>>> Like others, I would only use regexp as a last resort in Java anyway;
>>> compared to Perl, writing the code is painful. Still, I guess that you
>>> know this!
>>>
>>> Phil
>>
>> --
>> ===========================================================
>> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
>> ===========================================================
>>
>>
>>
>>
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l




More information about the Biojava-l mailing list