[Biojava-l] Evolutionary distances

Andy Yates ayates at ebi.ac.uk
Wed Oct 24 13:49:38 UTC 2007


Beat me to it :)

Andy

Mark Schreiber wrote:
> It appears it is as simple as:
> 
> Runtime.getRuntime().availableProcessors();
> 
> - Mark
> 
> On 10/24/07, Mark Schreiber <markjschreiber at gmail.com> wrote:
>> I'm not aware of a way to determine the number of CPU's within a
>> program although possibly it is one the the environment variables
>> available from System.
>>
>> Even if it can't be determined there could be a method argument to
>> specify the number of threads to spawn.
>>
>> - Mark
>>
>> On 10/24/07, Richard Holland <holland at ebi.ac.uk> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> This particular code could easily be parallelised - given N threads, you
>>> can simply divide the input into N chunks and get each thread to process
>>> 1/Nth of the input. You then combine the output of each thread to do the
>>> final calculation.
>>>
>>> But, it'd be bad practice to always fork a predetermined N threads for a
>>> given task. It'd be much better to somehow be able to ask 'how parallel
>>> can I make this?' at runtime by checking system resources, or maybe get
>>> the parallel-savvy user to set an optional BioJava-wide parallelisation
>>> hint. N could then be determined and the task divided appropriately.
>>>
>>> cheers,
>>> Richard
>>>
>>> Mark Schreiber wrote:
>>>> Another important consideration after optimization is can the task be
>>>> multithreaded?  Almost all modern computers have at least 2 cores. So
>>>> if the algorithm can be parallelized you will get some performance
>>>> bonus on most machines.
>>>>
>>>> Modern JVM's will automagically try to use idle CPU's to execute new
>>>> threads spawned by the programmer.
>>>>
>>>> - Mark
>>>>
>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> Yes a very good point & one I was going to make before hand but forgot :)
>>>>>
>>>>> Also not to mention that micro-benchmarks/profiling in Java are
>>>>> notorious for giving false results due to VM warmup & JIT compilation
>>>>> optimisations. There is a framework hosted on Java.net somewhere which
>>>>> can perform VM warmups and code iterations to produce more accurate
>>>>> benchmarking results; but the name escapes me at the moment.
>>>>>
>>>>> However looking at this particular code I get the feeling that this is
>>>>> about as fast as its going to get without someone doing bitwise XOR
>>>>> operations or some C code ... that's not an open invitation for people
>>>>> to start recoding this in C :). At the end of the day the key to
>>>>> optimisation is to ask the question "is it fast enough already?". If it
>>>>> is then there's no point :)
>>>>>
>>>>> Andy
>>>>>
>>>>> Mark Schreiber wrote:
>>>>>> Hi -
>>>>>>
>>>>>> >From experience the best way to optimize java code is to run a
>>>>>> profiler. The one in Netbeans is quite good.
>>>>>>
>>>>>> The reason is that the hotspot or JIT compilers might natively compile
>>>>>> the part of the code that you think is slow and actually make it
>>>>>> faster than something else which becomes the bottle neck. Using a good
>>>>>> profiler you can detect how much time is spent in each method and pin
>>>>>> point some candidate methods for optimization. You can also see if
>>>>>> there is a burden due to creation of lots of objects.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>> Our code is very similar but not identical. The original programmer
>>>>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>>>>> were equal or not. It can then calculate the transitional changes &
>>>>>>> assume the rest are transversional.
>>>>>>>
>>>>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>>>>> String.charAt() with a two calls & referencing those chars might help
>>>>>>> but in all honesty I cannot say.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> Richard Holland wrote:
>>>> Thanks.
>>>>
>>>> Your code is similar to the code we have in
>>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>>>> see if it is identical, but it probably is.
>>>>
>>>> You can call our code like this:
>>>>
>>>>  // import statement for biojava phylo stuff
>>>>  import org.biojavax.bio.phylo.*;
>>>>
>>>>  // ...rest of code goes here
>>>>
>>>>  // call Kimura2P
>>>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>>>  String seq2 = ...;
>>>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>>>
>>>> Note that our implementation expects sequence strings to be in upper
>>>> case, so you'll need to make sure your data is upper case or has been
>>>> converted to upper case before calling our method.
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>> vineith kaul wrote:
>>>>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>>>>
>>>>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>>>>                                 // Number of aligned sites
>>>>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>
>>>>>>>>>>                         numberOfAlignedSites++;
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>>>>          return dist;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>>>>
>>>>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>>>>     Kimura2P
>>>>>>>>>>     is already implemented here, in
>>>>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>>>>
>>>>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>>>>     slow, I'm
>>>>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>>>>
>>>>>>>>>>     cheers,
>>>>>>>>>>     Richard
>>>>>>>>>>
>>>>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>>>>     > Hi,
>>>>>>>>>>     >
>>>>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>>>>     > --
>>>>>>>>>>     > Vineith Kaul
>>>>>>>>>>     > Masters Student Bioinformatics
>>>>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>>>>     > _______________________________________________
>>>>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>     >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     --
>>>>>>>>>>     Richard Holland
>>>>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>>>>     EMBL-EBI
>>>>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Vineith Kaul
>>>>>>>>>> Masters Student Bioinformatics
>>>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>> Georgia Tech, Atlanta
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>>
>>> iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
>>> IEyRleSs1+AziCvfhcES8wI=
>>> =uLDm
>>> -----END PGP SIGNATURE-----
>>>



More information about the Biojava-l mailing list