[Biojava-l] Evolutionary distances

Wed Oct 24 13:33:53 UTC 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This particular code could easily be parallelised - given N threads, you
can simply divide the input into N chunks and get each thread to process
1/Nth of the input. You then combine the output of each thread to do the
final calculation.

But, it'd be bad practice to always fork a predetermined N threads for a
given task. It'd be much better to somehow be able to ask 'how parallel
can I make this?' at runtime by checking system resources, or maybe get
the parallel-savvy user to set an optional BioJava-wide parallelisation
hint. N could then be determined and the task divided appropriately.

cheers,
Richard

Mark Schreiber wrote:
> Another important consideration after optimization is can the task be
> multithreaded?  Almost all modern computers have at least 2 cores. So
> if the algorithm can be parallelized you will get some performance
> bonus on most machines.
> 
> Modern JVM's will automagically try to use idle CPU's to execute new
> threads spawned by the programmer.
> 
> - Mark
> 
> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Yes a very good point & one I was going to make before hand but forgot :)
>>
>> Also not to mention that micro-benchmarks/profiling in Java are
>> notorious for giving false results due to VM warmup & JIT compilation
>> optimisations. There is a framework hosted on Java.net somewhere which
>> can perform VM warmups and code iterations to produce more accurate
>> benchmarking results; but the name escapes me at the moment.
>>
>> However looking at this particular code I get the feeling that this is
>> about as fast as its going to get without someone doing bitwise XOR
>> operations or some C code ... that's not an open invitation for people
>> to start recoding this in C :). At the end of the day the key to
>> optimisation is to ask the question "is it fast enough already?". If it
>> is then there's no point :)
>>
>> Andy
>>
>> Mark Schreiber wrote:
>>> Hi -
>>>
>>> >From experience the best way to optimize java code is to run a
>>> profiler. The one in Netbeans is quite good.
>>>
>>> The reason is that the hotspot or JIT compilers might natively compile
>>> the part of the code that you think is slow and actually make it
>>> faster than something else which becomes the bottle neck. Using a good
>>> profiler you can detect how much time is spent in each method and pin
>>> point some candidate methods for optimization. You can also see if
>>> there is a burden due to creation of lots of objects.
>>>
>>> - Mark
>>>
>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> Our code is very similar but not identical. The original programmer
>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>> were equal or not. It can then calculate the transitional changes &
>>>> assume the rest are transversional.
>>>>
>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>> String.charAt() with a two calls & referencing those chars might help
>>>> but in all honesty I cannot say.
>>>>
>>>> Andy
>>>>
>>>> Richard Holland wrote:
> Thanks.
> 
> Your code is similar to the code we have in
> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> see if it is identical, but it probably is.
> 
> You can call our code like this:
> 
>  // import statement for biojava phylo stuff
>  import org.biojavax.bio.phylo.*;
> 
>  // ...rest of code goes here
> 
>  // call Kimura2P
>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>  String seq2 = ...;
>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> 
> Note that our implementation expects sequence strings to be in upper
> case, so you'll need to make sure your data is upper case or has been
> converted to upper case before calling our method.
> 
> cheers,
> Richard
> 
> vineith kaul wrote:
>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>
>>>>>>>
>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>
>>>>>>>
>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>
>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>                                 // Number of aligned sites
>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>
>>>>>>>                         numberOfAlignedSites++;
>>>>>>>                 }
>>>>>>>
>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>                                 q++;
>>>>>>>                         }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>                                 q++;
>>>>>>>                         }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         }
>>>>>>>
>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>          return dist;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>
>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>     Kimura2P
>>>>>>>     is already implemented here, in
>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>
>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>     slow, I'm
>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>
>>>>>>>     cheers,
>>>>>>>     Richard
>>>>>>>
>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>     > Hi,
>>>>>>>     >
>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>     > --
>>>>>>>     > Vineith Kaul
>>>>>>>     > Masters Student Bioinformatics
>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>     > _______________________________________________
>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>     >
>>>>>>>
>>>>>>>
>>>>>>>     --
>>>>>>>     Richard Holland
>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>     EMBL-EBI
>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Vineith Kaul
>>>>>>> Masters Student Bioinformatics
>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>> Georgia Tech, Atlanta
_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
IEyRleSs1+AziCvfhcES8wI=
=uLDm
-----END PGP SIGNATURE-----