Bioperl: relative-majority consensus, fast code sought
Georg Fuellen
fuellen@alum.mit.edu
Mon, 1 Mar 1999 21:36:30 +0100 (MET)
Many thanks to Andrew Dalke for mailing me a suggestion to
accelerate the sort-- he has also pointed me to two typos
which are now corrected in the example appended.
Any further hints would be greatly appreciated :)
Andrew Dalke wrote,
> Finally, why do the sorts (n*log(n)) operations when you really
> want a search (order(n))?
>
> @chars = split(//, "AGCATCGATGATAGTGCTATGAATCGTATGATATGCCAA");
> $threshold = 0.15;
> $threshold *= ($#chars+1);
>
> %temp = (); # count the occurance of each character
> grep ++$temp{$_}, @chars;
>
> $best_key = undef; # flag to tell if something met the threshold
> $best_val = $threshold; # known minimum value
> while ( ($k, $v) = each %temp) { # linear search to find the max
> if ($v>$best_val ||
> ($v==$best_val && (! $best_key || $best_key lt $k))) {
> $best_key = $k;
> $best_val = $v;
> }
> }
> if ($best_key) {
> print "$best_key ($best_val)\n";
> } else {
> print "!\n";
> }
Thanks-- most of the @char vectors are just 4-10 characters long,
but since the subroutine is called many many times, it's probably
a big improvement (it seems to be about 50% on my small test data set)
> Andrew
> dalke@bioreason.com
>
Corrected code (hopefully;-):
>>>>>>>
Goal: To compute the consensus of the letters in @chars,
w/ the option to request the relative-majority winner,
and break ties according to alphabetical order.
$threshold = 0; # $threshold==0 (or 0.249) implies relative majority
# $threshold==0.33 implies relative majority > one-third
# $threshold==0.5 implies absolute majority,
# $threshold==0.66 a two-thirds majority
$threshold *= ($#chars+1); #eg if there are 50 chars, $threshold==0.5,
#25 is the lower bound for absolute majority
%temp = ();
@list = sort { $temp{$b}<=>$temp{$a} } grep ++$temp{$_} > $threshold, @chars;
#@list is ordered by number of occurances, only chars observed enough times
@list2 = sort {$a cmp $b} grep { $temp{$_} == $temp{$list[0]} } @list;
#@list2 is ordered lexicographically, only chars observed most often
$consensus = (defined($list2[0]) ? $list2[0] : "!");
#"!" -> no consensus
<<<<<<<
best wishes,
georg
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================