[Biojava-dev] [Biojava-l] regex performance in Java
P. Troshin
to.petr at gmail.com
Thu Oct 25 13:10:56 UTC 2012
Hi Hilmar,
I feel that we do not have enough information to help you with your
performance issue.
We need to know how many regexp searches are you making in a single
transaction (user request-response) and the acceptable response time.
Otherwise, we might be doing preliminary optimization. We cannot tell
whether 500K matches in 3 seconds is fast enough or not.
In the test you are only testing the search performance; however, I am
sure that in the real system, you must be doing something with your
findings. So to make a real comparison, we need to see the test that
incorporates all the common operations that your real life systems
perform with these regexps.
When rewriting something in Java you have to bear in mind that just
translating Perl to Java is not enough, as these are two very
different languages. Just like rewriting Java in Perl. You are
unlikely to get any decent performance or readable code from just from
translating the structure of one program into another. You should give
some consideration to the design of your application however small it
might be. Each language has its own specifics.
Bear in mind that object construction in Java is more expensive than
function calls in Perl. Also regexp performance could vary enormously
depending on the size of the input string. If you are searching in
article abstracts then you need to test on the collection of these
abstracts, not part of them. The test data seems to be coming from the
abstract but it is split into different string, I think if you merge
all these strings into one, you would get a better performance.
Remember that Java String are encoded in UTF16 by default, in this
encoding a single character takes 4 bites, if you are only interested
in ASCII charset, then you should tell Java about it. Check out the
various String constructors and use the appropriate one. Similarly, if
you are loading data from the file system, make sure you set the
appropriate charset on the Reader. I am not sure whether this will
give you a performance increase, but it will certainly reduce the
program’s memory footprint.
A small point - is it really necessary to construct matchers within
the test loop? Not in your test I think, you can take them out.
I hope that will be of some help.
Regards,
Peter
On 25 October 2012 04:59, Hilmar Lapp <hlapp at drycafe.net> wrote:
> We're only at the very beginning of this. I really appreciate all the feedback, but honestly all that the code examples justify at this point is asking whether the performance differences are reproducible (and it seems they are), and whether or not they are attributable to some stupid mistakes in programming the java.util.regex API (and it seems that they aren't).
>
> Dan, yes I know UIMA. Again, way down the road of this.
>
> -hilmar
>
> On Oct 24, 2012, at 11:50 PM, Mark Fortner wrote:
>
>> Have you tried profiling the code to see where it's spending most of its time?
>>
>> Mark
>>
>> On Oct 24, 2012 8:47 PM, "Hilmar Lapp" <hlapp at drycafe.net> wrote:
>> The code is a very small snippet from a natural language processing software aimed at extracting structured phenotype descriptions from un- or semistructured free text. Apparently the code as is (in Perl) makes a lot of regular expression matches, and so if the speed difference for them between Perl and Java is significant, in theory this might become a problem. Though whether it will or will not amount to a bottleneck indeed remains to be seen, as the code is also doing other things that are potentially expensive, and possibly more so than the regex matching.
>>
>> So the exercise here is merely to see whether there is a notable performance difference in regex pattern evaluation that can't simply be attributed to programming mistakes (and apparently there is).
>>
>> -hilmar
>>
>> On Oct 24, 2012, at 2:30 PM, P. Troshin wrote:
>>
>> > Hi Hilmar,
>> >
>> > Looked at the test in a bit more details, I can see what you are
>> > trying to test but is there a real life problem behind this?
>> > What this test is doing is a lot of searches on very short strings. Is
>> > this what your real life application does? I am asking because if your
>> > real life application uses regexp to look into long string, the
>> > performance might be totally different.
>> > What is your aim - 3 seconds for 500K searches do not seem
>> > particularly slow to me.
>> >
>> > Thanks
>> > Peter
>> >
>> >
>> > On 24 October 2012 19:10, P. Troshin <to.petr at gmail.com> wrote:
>> >> Hi Hilmar,
>> >>
>> >> Hmm, it looks like I spoke too soon; the previous run was doing
>> >> nothing as all of the cases were commented out.
>> >> I can now see that the results of my runs are not massively different
>> >> from that of yours.
>> >> It would help if you could encourage your student to write a few unit
>> >> tests so that we know what you are trying to achieve and to simplify
>> >> the testing.
>> >>
>> >> Just a thought
>> >>
>> >> Thanks,
>> >> Peter
>> >>
>> >>
>> >>
>> >> On 24 October 2012 17:47, Hilmar Lapp <hlapp at drycafe.net> wrote:
>> >>> Hi everyone,
>> >>>
>> >>> Thanks for all your responses. Indeed I know that the Java regex API isn't an enjoyable one to program with, and if the underlying task were about writing something from scratch, I'd be all for avoiding regex's too if the same thing could be achieved by string comparison.
>> >>>
>> >>> However, and of course I failed to say that initially, the task from which this query is originating is about converting a Perl script to Java (not because Perl is somehow bad, but because those Perl scripts have shown to be an obstacle to easy cross-platform installation of the - mostly Java - software they are a part of). That doesn't mean one couldn't in the course also rewrite the code that uses regular expressions to one that doesn't, but I also think it wise not to introduce multiple variables as a source of error at once.
>> >>>
>> >>> Some of the responses would be best answered by looking at the expressions and the code that uses them, so here are the two "benchmark" scripts.
>> >>>
>> >>> Java: https://gist.github.com/3940931
>> >>> Perl: https://gist.github.com/3940780
>> >>>
>> >>> I'm also copying Dongye Meng here, who is a CS student at UNC working with us on the project - if anyone has further wisdom to share about how to reduce the performance gap between the two versions, he'd surely appreciate.
>> >>>
>> >>> -hilmar
>> >>>
>> >>> On Oct 23, 2012, at 6:42 AM, Phillip Lord wrote:
>> >>>
>> >>>> Hilmar Lapp <hlapp at drycafe.net> writes:
>> >>>>> They (at least as in java.util.regex) have been reported to me as
>> >>>>> performing much slower (by several orders of magnitude) than the regex
>> >>>>> implementation in Perl, and some simple benchmarking tests seem to
>> >>>>> bear that out. Even after scrutinizing the benchmark and finding
>> >>>>> nothing obvious, I'm still skeptical as to why this would be the case
>> >>>>> - naively I would have assumed that the underlying runtime library is
>> >>>>> implemented in C in both cases. But perhaps this is not true?
>> >>>>
>> >>>>
>> >>>> Well, the difference is that Perl is perl, while Java is not; it all
>> >>>> depends on the JVM, and libraries also. A quick shuftie at
>> >>>> the source for the open-jdk libraries suggests that the regexp searching
>> >>>> is done in Java -- it's not just a drop through to C. Always the problem
>> >>>> with performance optimisation on Java -- you are only optimising for one
>> >>>> situation. It might be interesting to see how much variation there is
>> >>>> between JVMs.
>> >>>>
>> >>>> Like others, I would only use regexp as a last resort in Java anyway;
>> >>>> compared to Perl, writing the code is painful. Still, I guess that you
>> >>>> know this!
>> >>>>
>> >>>> Phil
>> >>>
>> >>> --
>> >>> ===========================================================
>> >>> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
>> >>> ===========================================================
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>> >>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>> --
>> ===========================================================
>> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
>> ===========================================================
>>
>>
>>
>>
>>
>> _______________________________________________
>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
More information about the biojava-dev
mailing list