[Biojava-l] Biojava-l Digest, Vol 117, Issue 10

Khalil El Mazouari khalil.elmazouari at gmail.com
Tue Oct 23 16:48:32 UTC 2012


Hi Hilmar,

I used regex a lot in perl and java... I was also skeptical about the regex in java when I start using them. 

from my own experience, I can tell you the following:
it's MUCH more easy to write regex in perl than in java.
java regex require more optimisation: working regex and optimal regex are two different things
in java, Patterns must be compiled first. So, if you iterate through a large number of strings you want to match, compile your pattern outside the loop
if you use regex in large iteration, avoid using methods from java.lang.String that use regex: String.replaceFirst, String.replaceAll, String.matches.... your pattern will be compiled each time
Avoid applying regex to large string. If possible, try to limit the matches to the places where the pattern is .. methods like indexOf, lastIndexOf, split ... from java.lang.String are very useful in this regards.
It's more easy to get the matching group in java than in perl
test first with editors like : RegExhibit or your IDE regex plugin.
finally, I recommend the Java Regular Expressions book from Mehran Habibi (http://www.amazon.com/Java-Regular-Expressions-Taming-java-util-regex/dp/1590591070)

If your regex are well optimised, you will not notice any difference between perl/java. 

If you need to use regex in complex algorithm or software in combination with java/biojava, don't hesitate, java regex are excellent. If you just need regex in small script go for perl

Best

khalil 



On 22 Oct 2012, at 18:00, biojava-l-request at lists.open-bio.org wrote:

> Send Biojava-l mailing list submissions to
> 	biojava-l at lists.open-bio.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.open-bio.org/mailman/listinfo/biojava-l
> or, via email, send a message with subject or body 'help' to
> 	biojava-l-request at lists.open-bio.org
> 
> You can reach the person managing the list at
> 	biojava-l-owner at lists.open-bio.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biojava-l digest..."
> 
> 
> Today's Topics:
> 
>   1. regex performance in Java (Hilmar Lapp)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 22 Oct 2012 10:52:24 -0400
> From: Hilmar Lapp <hlapp at drycafe.net>
> Subject: [Biojava-l] regex performance in Java
> To: BioJava <biojava-l at biojava.org>
> Message-ID: <1B62BC3E-B005-4484-AE66-0B8F407E4756 at drycafe.net>
> Content-Type: text/plain; charset=us-ascii
> 
> I know that this is really Java language topic, but since parsing biological data formats is to rife with regular expression applications, I'm curious what the experience is among the Biojava people with the use of regular expressions in Java. 
> 
> They (at least as in java.util.regex) have been reported to me as performing much slower (by several orders of magnitude) than the regex implementation in Perl, and some simple benchmarking tests seem to bear that out. Even after scrutinizing the benchmark and finding nothing obvious, I'm still skeptical as to why this would be the case - naively I would have assumed that the underlying runtime library is implemented in C in both cases. But perhaps this is not true?
> 
> Any experience people have made here speed-wise (or tricks or things not to do for Java regex's) would be appreciated.
> 
> 	-hilmar
> -- 
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> ===========================================================
> 
> 
> 
> 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> 
> End of Biojava-l Digest, Vol 117, Issue 10
> ******************************************





-----

Confidentiality Notice: This e-mail and any files transmitted with it are private and confidential and are solely for the use of the addressee. It may contain material which is legally privileged. If you are not the addressee or the person responsible for delivering to the addressee, please notify that you have received this e-mail in error and that any use of it is strictly prohibited. It would be helpful if you could notify the author by replying to it.







More information about the Biojava-l mailing list