[Biojava-dev] Pairwise similarity score speed

Thu Jul 1 23:02:48 UTC 2010

One additional note: the number of similarities is cached, so after being 
computed on the first call to SequencePair.getNumSimilars() or 
FractionalSimilarityScorer.getScore(), further calls simply read a variable.

This means the first iteration over a list of sequence pairs may take a while 
for alignment and calculation of similarities, but later iterations will be fast 
so no additional array of scores is needed for storage.

Mark

On 7/1/2010 8:08 AM, Andy Yates wrote:
> Looking at the implementation the pairwise scores for the FractionalSimilarityScorer comes from SequencePair.getNumSimilars() so the pairwise scores should always be easily available
>
> Andy
>
> On 1 Jul 2010, at 13:02, Radhouane Aniba wrote:
>
>> Thx Andy,
>>
>> Yes absolutely true what you said. But my question was much concerning when Mark is calculating pairwise scores I don't know if there is an "elegant way" to keep temporary the pairwise scores for further treatments, i'm specifically thinking about an array of scores.
>>
>> TBC ..
>>
>> Rad
>>
>> 2010/7/1 Scooter Willis<HWillis at scripps.edu>
>> Andy
>>
>> I agree we should probably include the commons statistics jar as a must have for anything statistical or p-score related.
>>
>> Scooter
>>
>>
>>
>> On 7/1/10 7:28 AM, "Andy Yates"<ayates at ebi.ac.uk>  wrote:
>>
>> I believe that you could use Jakarta commons-math&  more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation&  anything else you'd expect to be able to use to describe a range of values
>>
>> Andy
>>
>> On 1 Jul 2010, at 12:05, Radhouane Aniba wrote:
>>
>>> Mark !
>>>
>>> Can we modify the code you gave to :
>>>
>>> - extract the MIN pairwise similarity score
>>> - extract the MAX pairwise similarity score
>>> - calculate the standard deviation of similarity scores
>>>
>>> ?
>>>
>>> Rad
>>>
>>> 2010/6/30 Mark Chapman<chapman at cs.wisc.edu>
>>>
>>>> Hi Radwen,
>>>>
>>>> I have already added this functionality to the BioJava3 alignment package.
>>>> The code is available on the repository [1] and current builds are on the
>>>> web site [2].  The necessary files are [3] and [4] and in the example code
>>>> that follows you should only have to replace "piwi-seed-fasta.txt" with your
>>>> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just
>>>> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL .
>>>>
>>>>
>>>> int similars = 0, total = 0;
>>>> GapPenalty gaps = new SimpleGapPenalty();
>>>> SubstitutionMatrix<AminoAcidCompound>  blosum62 =
>>>>    new SimpleSubstitutionMatrix<AminoAcidCompound>();
>>>>
>>>> List<ProteinSequence>  piwi = new ArrayList<ProteinSequence>();
>>>> try {
>>>> piwi.addAll(FastaReaderHelper.readFastaProteinSequence(
>>>>      new File("piwi-seed-fasta.txt")).values());
>>>> } catch (Exception e) {
>>>> e.printStackTrace();
>>>> }
>>>>
>>>> for (SequencePair<ProteinSequence, AminoAcidCompound>  pair :
>>>>    Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps,
>>>>        blosum62)) {
>>>> PairwiseSequenceScorer<ProteinSequence, AminoAcidCompound>  scorer =
>>>>      new FractionalSimilarityScorer<ProteinSequence,
>>>> AminoAcidCompound>(pair);
>>>> System.out.printf("%n%s vs %s : %d / %d%n%s",
>>>> pair.getQuery().getAccession(),
>>>>      pair.getTarget().getAccession(), scorer.getScore(),
>>>> scorer.getMaxScore(),
>>>>      pair);
>>>> similars += scorer.getScore();
>>>> total += scorer.getMaxScore();
>>>> }
>>>>
>>>> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total,
>>>>    (double)similars/total);
>>>>
>>>> ConcurrencyTools.shutdown();
>>>>
>>>>
>>>> [1] http://biojava.org/wiki/CVS_to_SVN_Migration
>>>> [2] http://biojava.org/download/maven/
>>>> [3]
>>>> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar
>>>> [4]
>>>> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar
>>>>
>>>>
>>>> Enjoy,
>>>> Mark
>>>>
>>>>
>>>>
>>>> On 6/30/2010 7:05 AM, Andy Yates wrote:
>>>>
>>>>> It was more of a way of decomposing the operations into a data structure
>>>>> where each element in the 1st dimension represents the elements to compare
>>>>> together. Really the Perl code is a way of describing the operations to
>>>>> occur in order to cover all possible permutations.
>>>>>
>>>>> Andy
>>>>>
>>>>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote:
>>>>>
>>>>> Hi Andy,
>>>>>>
>>>>>> Thank you for your reply.
>>>>>> Actually, I was thinking about a parallelization method or a kind of
>>>>>> hadoop like implementation to do all pairwise comparison. The aim is that at
>>>>>> the end i would like to calculate the average pairwise similarity score
>>>>>> within a set of sequences.
>>>>>>
>>>>>> What I am doing is something like that :
>>>>>>
>>>>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
>>>>>>   For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
>>>>>>      PairwiseScore
>>>>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
>>>>>> End_For
>>>>>> End_For
>>>>>>
>>>>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)
>>>>>>
>>>>>> In fact the problem is in the ((n * (n-1)) / 2) operations.
>>>>>>
>>>>>> As for the solution presented in perl sorry but I dont see what you've
>>>>>> did inside ?! You created a 2D array ? how to achieve operations inside , I
>>>>>> think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?
>>>>>>
>>>>>> Radwen
>>>>>>
>>>>>>
>>>>>> 2010/6/30 Andy Yates<ayates at ebi.ac.uk>
>>>>>> Hi Radwen,
>>>>>>
>>>>>> I would have said that this is more of a problem because of the type of
>>>>>> algorithm you are using. It's impossible (as far as I am aware) to calculate
>>>>>> the score matrices in one step for multiple sequences&   even if it did I
>>>>>> don't quite see where the speed increase would come from.
>>>>>>
>>>>>> As for the All vs. All problem don't forget that really your total number
>>>>>> of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you
>>>>>> are comparing so a simple 2D for loop will have you spending twice the
>>>>>> amount of time on this than needs to occur. When I've done this before (in
>>>>>> Perl so excuse the usage of it) the code looks like this:
>>>>>>
>>>>>> my @output;
>>>>>> my @elements = ('some','elements','something');
>>>>>> while(scalar(@elements)>   1) {
>>>>>> my $target = pop(@elements);
>>>>>> foreach my $remaining_element (@elements) {
>>>>>>    push(@output, [$target, $remaining_element]);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> So this would have emitted:
>>>>>>
>>>>>> [
>>>>>>        ['some','elements'],
>>>>>>        ['some','something'],
>>>>>>        ['elements','something']
>>>>>> ]
>>>>>>
>>>>>> Try doing something similar to this using the Java Deque objects which
>>>>>> can act as a stack.
>>>>>>
>>>>>> Hope this helps to answer your question
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
>>>>>>
>>>>>> Hello Biojava people,
>>>>>>>
>>>>>>> I have a question concerning Needlman Wunsh or Smith waterman
>>>>>>> algorithms.
>>>>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file
>>>>>>> then I
>>>>>>> store my sequences into an array to calculate pairwise similarity scores
>>>>>>> using a for loop.
>>>>>>> The problem is that it is very time consuming if we want to calculate
>>>>>>> all
>>>>>>> pairwise for a big number of protein sequences. I would like to know if
>>>>>>> there is way to do a kind of "All against All" comparisons in one single
>>>>>>> step ?
>>>>>>> Do someone have a solution for this kind of problem ?
>>>>>>>
>>>>>>> Thanks for help.
>>>>>>>
>>>>>>> Radwen
>>>>>>> _______________________________________________
>>>>>>> biojava-dev mailing list
>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>
>>>>>>
>>>
>>>
>>> --
>>> R. ANIBA
>>>
>>> Bioinformatics PhD
>>> Laboratoire de Bioinformatique et Génomique Intégrative,
>>> Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC),
>>> 1 rue Laurent Fries,
>>> 67404 Illkirch, France.
>>> http://www-igbmc.u-strasbg.fr
>>> http://alnitak.u-strasbg.fr/~aniba/alexsys
>>>
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>
>>
>>
>>
>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>>
>>
>>
>> --
>> R. ANIBA
>>
>> Bioinformatics PhD
>> Laboratoire de Bioinformatique et Génomique Intégrative,
>> Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC),
>> 1 rue Laurent Fries,
>> 67404 Illkirch, France.
>> http://www-igbmc.u-strasbg.fr
>> http://alnitak.u-strasbg.fr/~aniba/alexsys
>