[Biojava-l] Evolutionary distances

Richard Holland holland at ebi.ac.uk
Wed Oct 24 13:53:29 UTC 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Mark Schreiber wrote:
> I'm not aware of a way to determine the number of CPU's within a
> program although possibly it is one the the environment variables
> available from System.

Yup, I'm not aware of one either. Actually, thinking about this, it'd be
a bad thing if BioJava grabbed both CPUs just because they're currently
available - the user might want it to only run on one, with something
else running on the second one. So attempting to guess a good
parallelisation value from the system is probably not good!

> Even if it can't be determined there could be a method argument to
> specify the number of threads to spawn.

I was thinking more along the lines of a global static method in some
kind of toolkit class, so that any part of BJ which is
parallelisation-aware can take advantage of it if it is set. This also
avoids passing parameters that don't have an immediately obvious impact
on the expected output of the method. I'd also like to have this global
variable control the total number of threads, so that if the user forks
a set of threads themselves and runs a parallel-aware method in each of
them, then BJ will not attempt to sub-divide each thread into more
threads than the limit configured by this variable. Likewise if the user
changes the limit whilst threads are currently running, they should stop
(if there are too many) or new ones should start (if there are too few),
but taking care to make sure that every parallelisation request
maintains at least one thread so the job doesn't stop entirely.... there
must be a toolkit for this somewhere surely?

cheers,
Richard

> - Mark
> 
> On 10/24/07, Richard Holland <holland at ebi.ac.uk> wrote:
> This particular code could easily be parallelised - given N threads, you
> can simply divide the input into N chunks and get each thread to process
> 1/Nth of the input. You then combine the output of each thread to do the
> final calculation.
> 
> But, it'd be bad practice to always fork a predetermined N threads for a
> given task. It'd be much better to somehow be able to ask 'how parallel
> can I make this?' at runtime by checking system resources, or maybe get
> the parallel-savvy user to set an optional BioJava-wide parallelisation
> hint. N could then be determined and the task divided appropriately.
> 
> cheers,
> Richard
> 
> Mark Schreiber wrote:
>>>> Another important consideration after optimization is can the task be
>>>> multithreaded?  Almost all modern computers have at least 2 cores. So
>>>> if the algorithm can be parallelized you will get some performance
>>>> bonus on most machines.
>>>>
>>>> Modern JVM's will automagically try to use idle CPU's to execute new
>>>> threads spawned by the programmer.
>>>>
>>>> - Mark
>>>>
>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> Yes a very good point & one I was going to make before hand but forgot :)
>>>>>
>>>>> Also not to mention that micro-benchmarks/profiling in Java are
>>>>> notorious for giving false results due to VM warmup & JIT compilation
>>>>> optimisations. There is a framework hosted on Java.net somewhere which
>>>>> can perform VM warmups and code iterations to produce more accurate
>>>>> benchmarking results; but the name escapes me at the moment.
>>>>>
>>>>> However looking at this particular code I get the feeling that this is
>>>>> about as fast as its going to get without someone doing bitwise XOR
>>>>> operations or some C code ... that's not an open invitation for people
>>>>> to start recoding this in C :). At the end of the day the key to
>>>>> optimisation is to ask the question "is it fast enough already?". If it
>>>>> is then there's no point :)
>>>>>
>>>>> Andy
>>>>>
>>>>> Mark Schreiber wrote:
>>>>>> Hi -
>>>>>>
>>>>>> >From experience the best way to optimize java code is to run a
>>>>>> profiler. The one in Netbeans is quite good.
>>>>>>
>>>>>> The reason is that the hotspot or JIT compilers might natively compile
>>>>>> the part of the code that you think is slow and actually make it
>>>>>> faster than something else which becomes the bottle neck. Using a good
>>>>>> profiler you can detect how much time is spent in each method and pin
>>>>>> point some candidate methods for optimization. You can also see if
>>>>>> there is a burden due to creation of lots of objects.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>> Our code is very similar but not identical. The original programmer
>>>>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>>>>> were equal or not. It can then calculate the transitional changes &
>>>>>>> assume the rest are transversional.
>>>>>>>
>>>>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>>>>> String.charAt() with a two calls & referencing those chars might help
>>>>>>> but in all honesty I cannot say.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> Richard Holland wrote:
>>>> Thanks.
>>>>
>>>> Your code is similar to the code we have in
>>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>>>> see if it is identical, but it probably is.
>>>>
>>>> You can call our code like this:
>>>>
>>>>  // import statement for biojava phylo stuff
>>>>  import org.biojavax.bio.phylo.*;
>>>>
>>>>  // ...rest of code goes here
>>>>
>>>>  // call Kimura2P
>>>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>>>  String seq2 = ...;
>>>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>>>
>>>> Note that our implementation expects sequence strings to be in upper
>>>> case, so you'll need to make sure your data is upper case or has been
>>>> converted to upper case before calling our method.
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>> vineith kaul wrote:
>>>>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>>>>
>>>>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>>>>                                 // Number of aligned sites
>>>>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>
>>>>>>>>>>                         numberOfAlignedSites++;
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>>>>          return dist;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>>>>
>>>>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>>>>     Kimura2P
>>>>>>>>>>     is already implemented here, in
>>>>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>>>>
>>>>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>>>>     slow, I'm
>>>>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>>>>
>>>>>>>>>>     cheers,
>>>>>>>>>>     Richard
>>>>>>>>>>
>>>>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>>>>     > Hi,
>>>>>>>>>>     >
>>>>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>>>>     > --
>>>>>>>>>>     > Vineith Kaul
>>>>>>>>>>     > Masters Student Bioinformatics
>>>>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>>>>     > _______________________________________________
>>>>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>     >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     --
>>>>>>>>>>     Richard Holland
>>>>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>>>>     EMBL-EBI
>>>>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Vineith Kaul
>>>>>>>>>> Masters Student Bioinformatics
>>>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHH05Y4C5LeMEKA/QRAouqAJ9TgDACIQLPeenSZcStDhkZQg/UuQCfc7sZ
cocyjnf9/T8H3uQJ+rW5m2U=
=Q6UR
-----END PGP SIGNATURE-----



More information about the Biojava-l mailing list