Bioperl: "lightweight analyses"
Andrew Dalke
dalke@bioreason.com
Sun, 07 Mar 1999 18:54:58 -0800
This is a multi-part message in MIME format.
--------------575712A676C9739C0C568F4A
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
I was looking at:
http://bio.perl.org/Projects/Sequence/
where it says:
> Desirable lightweight analyses not yet written
>
> Property analysis (MW, pI, etc.)
> Metric "sliding window" properties (GC-content, hydrophobicity, etc.)
For some strange reason, I was in the mood to write some perl today,
so I implemented at least the algorithms for doing these. I don't
do much with the bioperl code itself, so someone else will have to
look into integrating this into the existing framework. Also, as
I don't do much perl programming these days, that person should also
check to make sure I did things the modern way.
I've attached the code to this email. It's only a few routines
along with some test scripts and examples. With it, the calculation
of the approx. molecular weight is:
$w = &sum(&map_properties($seq, \%weights));
(of course, you need a list of molecular weights to do this mapping).
The code is similar for mapping any sort of property onto a residue;
though I make the assumption that all inputs will be a string of
residues in their single character abbreviation.
I also have a sliding window code. It is of the form:
window($size, $ref_to_list_of_values [, $ref_to_function])
where the ref_to_function is a reference to a function which takes
a list and returns the computed value for that window. If it
isn't given, the default is to compute an average.
Tieing them together gives a quick way to compute the windowed
average of hydrophobicity values:
@hydrophobicity = &map_properties($protein, \%scale_B);
@avg_hydro = &window(5, \@hydrophobicity);
and if you want to do a weighted triangle average over 3 terms you
could do:
sub tri3 {
return ($_[0] + 2*$_[1] + $_[2]) / 4;
}
@avg_hydro = &window(1, \@hydrophobicity, \&tri3);
I *think* GC content would be:
GC = ("G" => 1, "C" => 1, "A" => 0, "T" => 0);
@gc_content = &map_properties($nt, \%GC);
$total_gc_content = &find_average(@gc_content);
$windowed_gc_content = &window(3, @gc_content);
I say *think* because I'm more of a protein person and I'm not sure
what "GC content" means; I assume it's the number of G and C bases
over the total number of bases in the region considered. I also
say it because I haven't tested this example code.
Legal notice: this code may be distributed under whatever the
standard bioperl license is. If there isn't a standard license, it
may be be distributed under terms equal to the Python license (with
the appropriate replacement of names to Andrew Dalke and Bioreason,
Inc.).
Andrew
dalke@bioreason.com
--------------575712A676C9739C0C568F4A
Content-Type: application/x-perl; name="seqmap.pl"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename="seqmap.pl"
#!/usr/local/bin/perl -w
sub map_properties {
my($seq, $props) = @_;
my(%props) = %$props; # turn ref to map into map
@seq = split('', $seq); # turn string into list of characters
return map($props{$_}, @seq); # map the properties onto the residues
}
# Simple sum followed by divide.
# Hopefully you won't go past maxint :)
sub find_average {
my(@l) = @_;
my($sum) = 0.0;
foreach (@l) {
$sum += $_;
}
return $sum / ($#l+1);
}
# This *has* to be an existing perl function, but I don't
# know what it is.
sub sum {
my($sum);
foreach (@_) {
$sum += $_;
}
return $sum;
}
# Go over a list (reference with a sliding window. For each window
# position, compute some value (if the function to compute the value
# is not given, compute the average).
# Return the list of computed window values. This list will be
# 2*$size shorter than the original list because of edge effects.
# The effective window size is 2*$size+1, so
# if $size == 0, $window_size == 1
# if $size == 2, $window_size == 5,
# etc.
sub window {
my($size, $list, $func) = @_;
my($window_size) = 2*$size+1;
my(@list) = @$list;
my($i, @ret);
# If there is a user defined function
if (defined $func) {
my($end) = $#list - $window_size + 1;
# pass window sized slices to the function for farther evaluation
for ($i=0; $i<=$end; $i++) {
# and save the results
push(@ret, &$func(@list[$i, $i+$window_size-1]))
}
} else {
# compute the average directly
# accumulates the sum instead of recomputing all window sums
my($sum) = 0.0;
# Get the initial window sum (sans last term)
for ($i=0; $i<$window_size-1; $i++) {
$sum += $list[$i];
}
# go over the rest of the list, adding and subtracting as needed
for (; $i<=$#list; $i++) {
$sum += $list[$i];
push(@ret, $sum / $window_size);
$sum -= $list[$i-$window_size];
}
}
return @ret;
}
## Some tests of the above
%test_hash = (
'A' => 0,
'T' => 1,
'C' => 3,
'G' => 8,
);
$seq = 'AATCGAGCACAACTATGCGCTACATGATGACTGCGAACT';
@p = map_properties($seq, \%test_hash);
for ($i=0; $i<length($seq); $i++) {
if ($p[$i] != $test_hash{substr($seq,$i, 1)}) {
print "Problem at $i\n";
}
}
# check to see that the length is the summed weight
%unit_weight = (
'A' => 1,
'T' => 1,
'C' => 1,
'G' => 1,
);
$w = &sum(&map_properties($seq, \%unit_weight));
if (length($seq) != $w) {
print "Weights are different: got $w instead of ";
print length($seq);
print "\n";
}
# List and the average of the list
@test_set = ([[1, 1, 1, 1, 1], 1],
[[0, 1, 2, 3, 4], 2],
[[0], 0],
[[-3], -3],
[[1, 2], 1.5],
);
foreach $test (@test_set) {
$l = (@$test)[0];
$v = (@$test)[1];
$x = &find_average(@$l);
if ($x != $v) {
print "Not the same for @$l: $x != $v\n";
}
}
# A window size of 1 with averaging shouldn't change the list
foreach $test (@test_set) {
$l = (@$test)[0];
$v = (@$test)[1];
@x = &window(0, $l);
if (@x != @$l) {
print "Not the same: @$l => @x\n";
}
}
# Now use a function to do the averaging
foreach $test (@test_set) {
$l = (@$test)[0];
$v = (@$test)[1];
@x = &window(0, $l, \&find_average);
if (@x != @$l) {
print "Not the same with average function: @$l => @x\n";
}
}
@test_set2 = ([[1, 1, 1], [1]],
[[1, 1, 1, 1], [1, 1]],
[[1, 1, 1, 1, 1], [1, 1, 1]],
[[1, 2, 3, 4, 5], [2, 3, 4]],
[[1, 2, 4, 8], [7/3, 14/3]],
);
foreach $test (@test_set2) {
$l = (@$test)[0];
$v = (@$test)[1];
@x = &window(1, $l);
if (@x != @$v) {
print "why are these different? @$v and @x\n";
}
}
# And now, some examples
# Don't have a good list handy, so grabbed the data from:
# http://selene.biochem.uga.edu/tutorial/aaprops.html
# Of course, a real program should deal with the ends properly, but
# I don't have that data (so don't trust the result to 5 sig figs!)
# This table is in Daltons.
%weights = (
'A' => 71.09,
'C' => 103.15,
'D' => 114.08,
'E' => 128.11,
'F' => 147.19,
'G' => 57.07,
'H' => 138.16,
'I' => 113.18,
'K' => 129.19,
'L' => 113.18,
'M' => 131.21,
'N' => 114.1,
'P' => 97.13,
'Q' => 128.14,
'R' => 157.2,
'S' => 87.09,
'T' => 101.12,
'V' => 99.15,
'W' => 186.23,
'Y' => 163.12,
);
# The following comes from Branden and Tooze, p210 who go to reference
# K.Kyte and R.F.Doolittle
%scale_A = (
'F' => 2.8,
'M' => 1.9,
'I' => 4.5,
'L' => 3.8,
'V' => 4.2,
'C' => 2.5,
'W' => -0.9,
'A' => 1.8,
'T' => -0.7,
'G' => -0.4,
'S' => -0.8,
'P' => -1.6,
'Y' => -1.3,
'H' => -3.2,
'Q' => -3.5,
'N' => -3.5,
'E' => -3.5,
'K' => -3.9,
'D' => -3.5,
'R' => -4.5,
);
# D.A.Engelman, T.A.Steitz, and A. Goldman
%scale_B = (
'F' => 3.7,
'M' => 3.4,
'I' => 3.1,
'L' => 2.8,
'V' => 2.6,
'C' => 2.0,
'W' => 1.9,
'A' => 1.6,
'T' => 1.2,
'G' => 1.0,
'S' => 0.6,
'P' => -0.2,
'Y' => -0.7,
'H' => -3.0,
'Q' => -4.1,
'N' => -4.8,
'E' => -8.2,
'K' => -8.8,
'D' => -9.2,
'R' => -12.3,
);
sub pretty_plot {
my($min, $max, $width, @vals) = @_;
my($pos);
my($scale) = $width / ($max - $min);
print " " . "-" x $width . "\n";
my($i) = 0;
foreach (@vals) {
$pos = int(($_ - $min) * $scale);
print sprintf("%5d ", $i) . ' ' x ($pos-1) . '*' . "\n";
$i++;
}
print " " . "-" x $width . "\n";
}
# Picked
# gi|2624664|pdb|1AIJ|M Chain M, Photosynthetic Reaction Center From
# Rhodobacter Sphaeroides In The Charge-Neutral Dqaqb State
$protein = <<EOM;
AEYQNIFSQVQVRGPADLGMTEDVNLANRSGVGPFSTLLGWFGNAQLGPIYLGSLGVLSLFSGLMWFFTI
GIWFWYQAGWNPAVFLRDLFFFSLEPPAPEYGLSFAAPLKEGGLWLIASFFMFVAVWSWWGRTYLRAQAL
GMGKHTAWAFLSAIWLWMVLGFIRPILMGSWSEAVPYGIFSHLDWTNNFSLVHGNLFYNPFHGLSIAFLY
GSALLFAMHGATILAVSRFGGERELEQIADRGTAAERAALFWRWTMGFNATMEGIHRWAIWMAVLVTLTG
GIGILLSGTVVDNWYVWGQNHGMAPLN
EOM
$protein =~ s/[^A-Z]//g;
@hydrophobicity = &map_properties($protein, \%scale_B);
print "Raw values\n";
&pretty_plot(-12.5, 4, 70, @hydrophobicity);
print "\n\nSmoothed values\n";
@avg_hydro = &window(5, \@hydrophobicity);
&pretty_plot(-12.5, 4, 70, @avg_hydro);
$w = &sum(&map_properties($seq, \%weights));
print "The molecular weight of that protein is: $w\n";
--------------575712A676C9739C0C568F4A--
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================