[Biojava-l] Evolutionary distances
Richard Holland
holland at ebi.ac.uk
Wed Oct 24 13:33:53 UTC 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
This particular code could easily be parallelised - given N threads, you
can simply divide the input into N chunks and get each thread to process
1/Nth of the input. You then combine the output of each thread to do the
final calculation.
But, it'd be bad practice to always fork a predetermined N threads for a
given task. It'd be much better to somehow be able to ask 'how parallel
can I make this?' at runtime by checking system resources, or maybe get
the parallel-savvy user to set an optional BioJava-wide parallelisation
hint. N could then be determined and the task divided appropriately.
cheers,
Richard
Mark Schreiber wrote:
> Another important consideration after optimization is can the task be
> multithreaded? Almost all modern computers have at least 2 cores. So
> if the algorithm can be parallelized you will get some performance
> bonus on most machines.
>
> Modern JVM's will automagically try to use idle CPU's to execute new
> threads spawned by the programmer.
>
> - Mark
>
> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Yes a very good point & one I was going to make before hand but forgot :)
>>
>> Also not to mention that micro-benchmarks/profiling in Java are
>> notorious for giving false results due to VM warmup & JIT compilation
>> optimisations. There is a framework hosted on Java.net somewhere which
>> can perform VM warmups and code iterations to produce more accurate
>> benchmarking results; but the name escapes me at the moment.
>>
>> However looking at this particular code I get the feeling that this is
>> about as fast as its going to get without someone doing bitwise XOR
>> operations or some C code ... that's not an open invitation for people
>> to start recoding this in C :). At the end of the day the key to
>> optimisation is to ask the question "is it fast enough already?". If it
>> is then there's no point :)
>>
>> Andy
>>
>> Mark Schreiber wrote:
>>> Hi -
>>>
>>> >From experience the best way to optimize java code is to run a
>>> profiler. The one in Netbeans is quite good.
>>>
>>> The reason is that the hotspot or JIT compilers might natively compile
>>> the part of the code that you think is slow and actually make it
>>> faster than something else which becomes the bottle neck. Using a good
>>> profiler you can detect how much time is spent in each method and pin
>>> point some candidate methods for optimization. You can also see if
>>> there is a burden due to creation of lots of objects.
>>>
>>> - Mark
>>>
>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> Our code is very similar but not identical. The original programmer
>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>> were equal or not. It can then calculate the transitional changes &
>>>> assume the rest are transversional.
>>>>
>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>> String.charAt() with a two calls & referencing those chars might help
>>>> but in all honesty I cannot say.
>>>>
>>>> Andy
>>>>
>>>> Richard Holland wrote:
> Thanks.
>
> Your code is similar to the code we have in
> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> see if it is identical, but it probably is.
>
> You can call our code like this:
>
> // import statement for biojava phylo stuff
> import org.biojavax.bio.phylo.*;
>
> // ...rest of code goes here
>
> // call Kimura2P
> String seq1 = ...; // Get seq1 and seq2 from somewhere
> String seq2 = ...;
> double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>
> Note that our implementation expects sequence strings to be in upper
> case, so you'll need to make sure your data is upper case or has been
> converted to upper case before calling our method.
>
> cheers,
> Richard
>
> vineith kaul wrote:
>>>>>>> This is what I have .....Thanks a lot fr the help.
>>>>>>>
>>>>>>>
>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>
>>>>>>>
>>>>>>> char[] seq1array=sequence1.toCharArray();
>>>>>>> char[] seq2array=sequence2.toCharArray();
>>>>>>>
>>>>>>> for(int i=0;i<seq1array.length;i++){
>>>>>>> // Number of aligned sites
>>>>>>> if(((seq1array[i]=='a') ||
>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>
>>>>>>> numberOfAlignedSites++;
>>>>>>> }
>>>>>>>
>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>> p++;
>>>>>>> }
>>>>>>> else
>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>> p++;
>>>>>>> }
>>>>>>> else
>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>> p++;
>>>>>>> }
>>>>>>> else
>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>> p++;
>>>>>>> }
>>>>>>> else
>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>> q++;
>>>>>>> }
>>>>>>> else
>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>> q++;
>>>>>>> }
>>>>>>> else
>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>> q++;
>>>>>>> }
>>>>>>> else
>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>> q++;
>>>>>>> }
>>>>>>> else
>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>> q++;
>>>>>>> }
>>>>>>> else
>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>> q++;
>>>>>>> }
>>>>>>> else
>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>> q++;
>>>>>>> }
>>>>>>> else
>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>> q++;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>> return dist;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>
>>>>>>> You should take a look at the latest 1.5 release, in the
>>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>> phylogenetics code that will perform tasks as you describe. The future
>>>>>>> plan is to extend this code to cover a wider range of use cases.
>>>>>>> Kimura2P
>>>>>>> is already implemented here, in
>>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>
>>>>>>> If you can't find code that will do what you want, but have written some
>>>>>>> before, then please do feel free to contribute it. Even if it is
>>>>>>> slow, I'm
>>>>>>> sure someone out there will be able to help optimise it!
>>>>>>>
>>>>>>> cheers,
>>>>>>> Richard
>>>>>>>
>>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > Are there functions to calculate evolutionary pairwise distances like
>>>>>>> > Kimura2P,Finkelstein etc in Biojava
>>>>>>> > I did write smthng on my own but on large sequences it runs terribly
>>>>>>> > slow and I am not even sure if thats right.
>>>>>>> > --
>>>>>>> > Vineith Kaul
>>>>>>> > Masters Student Bioinformatics
>>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>> > Georgia Tech, Atlanta
>>>>>>> > _______________________________________________
>>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>>>>> <mailto:Biojava-l at lists.open-bio.org>
>>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Richard Holland
>>>>>>> BioMart ( http://www.biomart.org/)
>>>>>>> EMBL-EBI
>>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Vineith Kaul
>>>>>>> Masters Student Bioinformatics
>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>> Georgia Tech, Atlanta
_______________________________________________
Biojava-l mailing list - Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> _______________________________________________
>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
IEyRleSs1+AziCvfhcES8wI=
=uLDm
-----END PGP SIGNATURE-----
More information about the Biojava-l
mailing list