[Bioperl-l] t/SimpleAlign: not ok 18

17 Sep 2002 13:37:32 +0100

Allan,

I fixed the function according your suggestion. I also added you as
contributor to the module.

	-Heikki

On Fri, 2002-09-13 at 04:35, Allen Smith wrote:
> On Sep 12,  8:05am, Jason Stajich wrote:
> > I cannot replicate on either the released tarball or current 1-0-0 branch
> > on IRIX with perl 5.6.1.  Very strange.  Can it be a 5.8.0 bug?  That
> > seems odd but possible.
> 
> Well, I just took a look at SimpleAlign's consensus procedure, and I can see 
> why there's a difference - and it is a _bioperl_ bug, not a perl bug. Perl
> 5.8.0 uses a different hash algorithm, resulting in having a different
> ordering of letters with "each". The alignment in question has equal numbers 
> of 'D's and 'E's at the third position. Previously, the ordering of the hash 
> resulted in 'D' coming first; it now results in 'E' coming first. I suggest 
> 
> sub _consensus_aa {
>     my $self = shift;
>     my $point = shift;
>     my $threshold_percent = shift || -1 ;
>     my ($seq,%hash,$count,$letter,$key);
> 
>     foreach $seq ( $self->each_seq() ) {
>         $letter = substr($seq->seq,$point,1);
>         $self->throw("--$point-----------") if $letter eq '';
>         ($letter =~ /\./) && next;
>         # print "Looking at $letter\n";
>         $hash{$letter}++;
>     }
>     my $number_of_sequences = $self->no_sequences();
>     my $threshold = $number_of_sequences * $threshold_percent / 100. ;
>     $count = -1;
>     $letter = '?';
> 
>     foreach $key ( keys %hash ) {
>         # print "Now at $key $hash{$key}\n";
>         if( $hash{$key} > $count && $hash{$key} >= $threshold) {
>             $letter = $key;
>             $count = $hash{$key};
>         }
>     }
>     return $letter;
> }
> 
> be replaced with
> 
> sub _consensus_aa {
>     my $self = shift;
>     my $point = shift;
>     my $threshold_percent = shift || -1 ;
>     my ($seq,%hash,$count,$letter,$key);
> 
>     foreach $seq ( $self->each_seq() ) {
>         $letter = substr($seq->seq,$point,1);
>         $self->throw("--$point-----------") if $letter eq '';
>         ($letter =~ /\./) && next;
>         # print "Looking at $letter\n";
>         $hash{$letter}++;
>     }
>     my $number_of_sequences = $self->no_sequences();
>     my $threshold = $number_of_sequences * $threshold_percent / 100. ;
>     $count = -1;
>     $letter = '?';
> 
>     foreach $key ( sort(keys %hash) ) {
>         # print "Now at $key $hash{$key}\n";
>         if( $hash{$key} > $count && $hash{$key} >= $threshold) {
>             $letter = $key;
>             $count = $hash{$key};
>         }
>     }
>     return $letter;
> }
> 
> And any tests that differ as a result being edited in their expected
> answer. The 'sort' in the above will result in the consensus sequence not
> being affected by changes in the hash algorithm.
> 
> This is, however, not what I would describe as an ideal fix. I suggest that
> taking into account what the other residues are (if doing a protein
> consensus) and which one of the two (or more) tied residues they are most
> similar to would be preferable (using the CONSERVATION_GROUPS rules for
> which is most similar, probably, although allowing user modification of this 
> is desirable).
> 
> 	-Allen
> 
> -- 
> Allen Smith			http://cesario.rutgers.edu/easmith/
> September 11, 2001		A Day That Shall Live In Infamy II
> "They that can give up essential liberty to obtain a little temporary
> safety deserve neither liberty nor safety." - Benjamin Franklin
-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________