[Biojava-l] Evolutionary distances

Wed Oct 24 13:49:22 UTC 2007

Of course parallelisation all depends on the task not being limited by 
something else like memory, IO or database (which of course this 
wouldn't be). There's also the scenario where thread startup takes 
longer than running the code in serial :). Not to mention Java 
concurrency isn't an easy thing to write correctly.

I'd prefer the model promoted in Java5 where you have pools of threads & 
pass in instances of Callable (which are a successor to Runnable but 
return Futures which return objects & exceptions). You then pass in a 
list of these callables & wait for them all to finish & grab the 
results. You can have as many callables as you like & the thread pool 
will process them as & when a thread becomes free. Combine this with 
looking at the reported number of processors/cores on the machine & say 
that's the default size of the pool (assuming you're making it parallel 
because you're flat-lining a processor).

Say:

int processorCount = Runtime.getRuntime().availableProcessors();
ExecutorService.createThreadPool(processorCount);

This code might be wrong (well the creating the thread pool bit) but you 
get the idea :). Of course someone may not want to parallise a job (I 
quite like having dual cores as a runaway process can take out one but I 
can still run top & kill the thing).

Andy

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> This particular code could easily be parallelised - given N threads, you
> can simply divide the input into N chunks and get each thread to process
> 1/Nth of the input. You then combine the output of each thread to do the
> final calculation.
> 
> But, it'd be bad practice to always fork a predetermined N threads for a
> given task. It'd be much better to somehow be able to ask 'how parallel
> can I make this?' at runtime by checking system resources, or maybe get
> the parallel-savvy user to set an optional BioJava-wide parallelisation
> hint. N could then be determined and the task divided appropriately.
> 
> cheers,
> Richard
> 
> Mark Schreiber wrote:
>> Another important consideration after optimization is can the task be
>> multithreaded?  Almost all modern computers have at least 2 cores. So
>> if the algorithm can be parallelized you will get some performance
>> bonus on most machines.
>>
>> Modern JVM's will automagically try to use idle CPU's to execute new
>> threads spawned by the programmer.
>>
>> - Mark
>>
>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> Yes a very good point & one I was going to make before hand but forgot :)
>>>
>>> Also not to mention that micro-benchmarks/profiling in Java are
>>> notorious for giving false results due to VM warmup & JIT compilation
>>> optimisations. There is a framework hosted on Java.net somewhere which
>>> can perform VM warmups and code iterations to produce more accurate
>>> benchmarking results; but the name escapes me at the moment.
>>>
>>> However looking at this particular code I get the feeling that this is
>>> about as fast as its going to get without someone doing bitwise XOR
>>> operations or some C code ... that's not an open invitation for people
>>> to start recoding this in C :). At the end of the day the key to
>>> optimisation is to ask the question "is it fast enough already?". If it
>>> is then there's no point :)
>>>
>>> Andy
>>>
>>> Mark Schreiber wrote:
>>>> Hi -
>>>>
>>>> >From experience the best way to optimize java code is to run a
>>>> profiler. The one in Netbeans is quite good.
>>>>
>>>> The reason is that the hotspot or JIT compilers might natively compile
>>>> the part of the code that you think is slow and actually make it
>>>> faster than something else which becomes the bottle neck. Using a good
>>>> profiler you can detect how much time is spent in each method and pin
>>>> point some candidate methods for optimization. You can also see if
>>>> there is a burden due to creation of lots of objects.
>>>>
>>>> - Mark
>>>>
>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> Our code is very similar but not identical. The original programmer
>>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>>> were equal or not. It can then calculate the transitional changes &
>>>>> assume the rest are transversional.
>>>>>
>>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>>> String.charAt() with a two calls & referencing those chars might help
>>>>> but in all honesty I cannot say.
>>>>>
>>>>> Andy
>>>>>
>>>>> Richard Holland wrote:
>> Thanks.
>>
>> Your code is similar to the code we have in
>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>> see if it is identical, but it probably is.
>>
>> You can call our code like this:
>>
>>  // import statement for biojava phylo stuff
>>  import org.biojavax.bio.phylo.*;
>>
>>  // ...rest of code goes here
>>
>>  // call Kimura2P
>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>  String seq2 = ...;
>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>
>> Note that our implementation expects sequence strings to be in upper
>> case, so you'll need to make sure your data is upper case or has been
>> converted to upper case before calling our method.
>>
>> cheers,
>> Richard
>>
>> vineith kaul wrote:
>>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>>
>>>>>>>>
>>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>>
>>>>>>>>
>>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>>
>>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>>                                 // Number of aligned sites
>>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>
>>>>>>>>                         numberOfAlignedSites++;
>>>>>>>>                 }
>>>>>>>>
>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>                                 q++;
>>>>>>>>                         }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>                                 q++;
>>>>>>>>                         }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>         }
>>>>>>>>
>>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>>          return dist;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>>
>>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>>     Kimura2P
>>>>>>>>     is already implemented here, in
>>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>>
>>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>>     slow, I'm
>>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>>
>>>>>>>>     cheers,
>>>>>>>>     Richard
>>>>>>>>
>>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>>     > Hi,
>>>>>>>>     >
>>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>>     > --
>>>>>>>>     > Vineith Kaul
>>>>>>>>     > Masters Student Bioinformatics
>>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>>     > _______________________________________________
>>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>     >
>>>>>>>>
>>>>>>>>
>>>>>>>>     --
>>>>>>>>     Richard Holland
>>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>>     EMBL-EBI
>>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Vineith Kaul
>>>>>>>> Masters Student Bioinformatics
>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
> IEyRleSs1+AziCvfhcES8wI=
> =uLDm
> -----END PGP SIGNATURE-----