[Bioperl-l] t/SimpleAlign: not ok 18

Thu, 12 Sep 2002 23:35:06 -0400

On Sep 12,  8:05am, Jason Stajich wrote:
> I cannot replicate on either the released tarball or current 1-0-0 branch
> on IRIX with perl 5.6.1.  Very strange.  Can it be a 5.8.0 bug?  That
> seems odd but possible.

Well, I just took a look at SimpleAlign's consensus procedure, and I can see 
why there's a difference - and it is a _bioperl_ bug, not a perl bug. Perl
5.8.0 uses a different hash algorithm, resulting in having a different
ordering of letters with "each". The alignment in question has equal numbers 
of 'D's and 'E's at the third position. Previously, the ordering of the hash 
resulted in 'D' coming first; it now results in 'E' coming first. I suggest 

sub _consensus_aa {
    my $self = shift;
    my $point = shift;
    my $threshold_percent = shift || -1 ;
    my ($seq,%hash,$count,$letter,$key);

    foreach $seq ( $self->each_seq() ) {
        $letter = substr($seq->seq,$point,1);
        $self->throw("--$point-----------") if $letter eq '';
        ($letter =~ /\./) && next;
        # print "Looking at $letter\n";
        $hash{$letter}++;
    }
    my $number_of_sequences = $self->no_sequences();
    my $threshold = $number_of_sequences * $threshold_percent / 100. ;
    $count = -1;
    $letter = '?';

    foreach $key ( keys %hash ) {
        # print "Now at $key $hash{$key}\n";
        if( $hash{$key} > $count && $hash{$key} >= $threshold) {
            $letter = $key;
            $count = $hash{$key};
        }
    }
    return $letter;
}

be replaced with

sub _consensus_aa {
    my $self = shift;
    my $point = shift;
    my $threshold_percent = shift || -1 ;
    my ($seq,%hash,$count,$letter,$key);

    foreach $seq ( $self->each_seq() ) {
        $letter = substr($seq->seq,$point,1);
        $self->throw("--$point-----------") if $letter eq '';
        ($letter =~ /\./) && next;
        # print "Looking at $letter\n";
        $hash{$letter}++;
    }
    my $number_of_sequences = $self->no_sequences();
    my $threshold = $number_of_sequences * $threshold_percent / 100. ;
    $count = -1;
    $letter = '?';

    foreach $key ( sort(keys %hash) ) {
        # print "Now at $key $hash{$key}\n";
        if( $hash{$key} > $count && $hash{$key} >= $threshold) {
            $letter = $key;
            $count = $hash{$key};
        }
    }
    return $letter;
}

And any tests that differ as a result being edited in their expected
answer. The 'sort' in the above will result in the consensus sequence not
being affected by changes in the hash algorithm.

This is, however, not what I would describe as an ideal fix. I suggest that
taking into account what the other residues are (if doing a protein
consensus) and which one of the two (or more) tied residues they are most
similar to would be preferable (using the CONSERVATION_GROUPS rules for
which is most similar, probably, although allowing user modification of this 
is desirable).

	-Allen

-- 
Allen Smith			http://cesario.rutgers.edu/easmith/
September 11, 2001		A Day That Shall Live In Infamy II
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." - Benjamin Franklin