[Bioperl-l] t/SimpleAlign: not ok 18
Allen Smith
easmith@beatrice.rutgers.edu
Thu, 12 Sep 2002 23:35:06 -0400
On Sep 12, 8:05am, Jason Stajich wrote:
> I cannot replicate on either the released tarball or current 1-0-0 branch
> on IRIX with perl 5.6.1. Very strange. Can it be a 5.8.0 bug? That
> seems odd but possible.
Well, I just took a look at SimpleAlign's consensus procedure, and I can see
why there's a difference - and it is a _bioperl_ bug, not a perl bug. Perl
5.8.0 uses a different hash algorithm, resulting in having a different
ordering of letters with "each". The alignment in question has equal numbers
of 'D's and 'E's at the third position. Previously, the ordering of the hash
resulted in 'D' coming first; it now results in 'E' coming first. I suggest
sub _consensus_aa {
my $self = shift;
my $point = shift;
my $threshold_percent = shift || -1 ;
my ($seq,%hash,$count,$letter,$key);
foreach $seq ( $self->each_seq() ) {
$letter = substr($seq->seq,$point,1);
$self->throw("--$point-----------") if $letter eq '';
($letter =~ /\./) && next;
# print "Looking at $letter\n";
$hash{$letter}++;
}
my $number_of_sequences = $self->no_sequences();
my $threshold = $number_of_sequences * $threshold_percent / 100. ;
$count = -1;
$letter = '?';
foreach $key ( keys %hash ) {
# print "Now at $key $hash{$key}\n";
if( $hash{$key} > $count && $hash{$key} >= $threshold) {
$letter = $key;
$count = $hash{$key};
}
}
return $letter;
}
be replaced with
sub _consensus_aa {
my $self = shift;
my $point = shift;
my $threshold_percent = shift || -1 ;
my ($seq,%hash,$count,$letter,$key);
foreach $seq ( $self->each_seq() ) {
$letter = substr($seq->seq,$point,1);
$self->throw("--$point-----------") if $letter eq '';
($letter =~ /\./) && next;
# print "Looking at $letter\n";
$hash{$letter}++;
}
my $number_of_sequences = $self->no_sequences();
my $threshold = $number_of_sequences * $threshold_percent / 100. ;
$count = -1;
$letter = '?';
foreach $key ( sort(keys %hash) ) {
# print "Now at $key $hash{$key}\n";
if( $hash{$key} > $count && $hash{$key} >= $threshold) {
$letter = $key;
$count = $hash{$key};
}
}
return $letter;
}
And any tests that differ as a result being edited in their expected
answer. The 'sort' in the above will result in the consensus sequence not
being affected by changes in the hash algorithm.
This is, however, not what I would describe as an ideal fix. I suggest that
taking into account what the other residues are (if doing a protein
consensus) and which one of the two (or more) tied residues they are most
similar to would be preferable (using the CONSERVATION_GROUPS rules for
which is most similar, probably, although allowing user modification of this
is desirable).
-Allen
--
Allen Smith http://cesario.rutgers.edu/easmith/
September 11, 2001 A Day That Shall Live In Infamy II
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." - Benjamin Franklin