[Biojava-dev] Pairwise similarity score speed

Thu Jul 1 11:05:34 UTC 2010

Mark !

Can we modify the code you gave to :

- extract the MIN pairwise similarity score
- extract the MAX pairwise similarity score
- calculate the standard deviation of similarity scores

?

Rad

2010/6/30 Mark Chapman <chapman at cs.wisc.edu>

> Hi Radwen,
>
> I have already added this functionality to the BioJava3 alignment package.
>  The code is available on the repository [1] and current builds are on the
> web site [2].  The necessary files are [3] and [4] and in the example code
> that follows you should only have to replace "piwi-seed-fasta.txt" with your
> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just
> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL .
>
>
> int similars = 0, total = 0;
> GapPenalty gaps = new SimpleGapPenalty();
> SubstitutionMatrix<AminoAcidCompound> blosum62 =
>    new SimpleSubstitutionMatrix<AminoAcidCompound>();
>
> List<ProteinSequence> piwi = new ArrayList<ProteinSequence>();
> try {
>  piwi.addAll(FastaReaderHelper.readFastaProteinSequence(
>      new File("piwi-seed-fasta.txt")).values());
> } catch (Exception e) {
>  e.printStackTrace();
> }
>
> for (SequencePair<ProteinSequence, AminoAcidCompound> pair :
>    Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps,
>        blosum62)) {
>  PairwiseSequenceScorer<ProteinSequence, AminoAcidCompound> scorer =
>      new FractionalSimilarityScorer<ProteinSequence,
> AminoAcidCompound>(pair);
>  System.out.printf("%n%s vs %s : %d / %d%n%s",
> pair.getQuery().getAccession(),
>      pair.getTarget().getAccession(), scorer.getScore(),
> scorer.getMaxScore(),
>      pair);
>  similars += scorer.getScore();
>  total += scorer.getMaxScore();
> }
>
> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total,
>    (double)similars/total);
>
> ConcurrencyTools.shutdown();
>
>
> [1] http://biojava.org/wiki/CVS_to_SVN_Migration
> [2] http://biojava.org/download/maven/
> [3]
> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar
> [4]
> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar
>
>
> Enjoy,
> Mark
>
>
>
> On 6/30/2010 7:05 AM, Andy Yates wrote:
>
>> It was more of a way of decomposing the operations into a data structure
>> where each element in the 1st dimension represents the elements to compare
>> together. Really the Perl code is a way of describing the operations to
>> occur in order to cover all possible permutations.
>>
>> Andy
>>
>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote:
>>
>>  Hi Andy,
>>>
>>> Thank you for your reply.
>>> Actually, I was thinking about a parallelization method or a kind of
>>> hadoop like implementation to do all pairwise comparison. The aim is that at
>>> the end i would like to calculate the average pairwise similarity score
>>> within a set of sequences.
>>>
>>> What I am doing is something like that :
>>>
>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
>>>   For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
>>>      PairwiseScore
>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
>>>  End_For
>>> End_For
>>>
>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)
>>>
>>> In fact the problem is in the ((n * (n-1)) / 2) operations.
>>>
>>> As for the solution presented in perl sorry but I dont see what you've
>>> did inside ?! You created a 2D array ? how to achieve operations inside , I
>>> think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?
>>>
>>> Radwen
>>>
>>>
>>> 2010/6/30 Andy Yates<ayates at ebi.ac.uk>
>>> Hi Radwen,
>>>
>>> I would have said that this is more of a problem because of the type of
>>> algorithm you are using. It's impossible (as far as I am aware) to calculate
>>> the score matrices in one step for multiple sequences&  even if it did I
>>> don't quite see where the speed increase would come from.
>>>
>>> As for the All vs. All problem don't forget that really your total number
>>> of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you
>>> are comparing so a simple 2D for loop will have you spending twice the
>>> amount of time on this than needs to occur. When I've done this before (in
>>> Perl so excuse the usage of it) the code looks like this:
>>>
>>> my @output;
>>> my @elements = ('some','elements','something');
>>> while(scalar(@elements)>  1) {
>>>  my $target = pop(@elements);
>>>  foreach my $remaining_element (@elements) {
>>>    push(@output, [$target, $remaining_element]);
>>>  }
>>> }
>>>
>>> So this would have emitted:
>>>
>>> [
>>>        ['some','elements'],
>>>        ['some','something'],
>>>        ['elements','something']
>>> ]
>>>
>>> Try doing something similar to this using the Java Deque objects which
>>> can act as a stack.
>>>
>>> Hope this helps to answer your question
>>>
>>> Andy
>>>
>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
>>>
>>>  Hello Biojava people,
>>>>
>>>> I have a question concerning Needlman Wunsh or Smith waterman
>>>> algorithms.
>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file
>>>> then I
>>>> store my sequences into an array to calculate pairwise similarity scores
>>>> using a for loop.
>>>> The problem is that it is very time consuming if we want to calculate
>>>> all
>>>> pairwise for a big number of protein sequences. I would like to know if
>>>> there is way to do a kind of "All against All" comparisons in one single
>>>> step ?
>>>> Do someone have a solution for this kind of problem ?
>>>>
>>>> Thanks for help.
>>>>
>>>> Radwen
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>
>>>

-- 
R. ANIBA

Bioinformatics PhD
Laboratoire de Bioinformatique et Génomique Intégrative,
Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC),
1 rue Laurent Fries,
67404 Illkirch, France.
http://www-igbmc.u-strasbg.fr
http://alnitak.u-strasbg.fr/~aniba/alexsys