[Bioperl-l] calculate the frequency of occurrence of the mostcommonly observed amino acid at each position of multiplesequence alignment

Sat Feb 7 16:56:30 UTC 2009

Dylan- It's worth mentioning that the BioPerl method is very overhead-heavy; all
the objects make it easy to just write a few lines, but probably won't be the absolute
fastest way to do what you want. Another path to follow would be

# your seqs are plain strings in the array @seqs, and are aligned and same length
my $len  = length($seqs[0]);
my @residue_counts;
foreach (0..$len-1) {
  my %h = ();
  foreach $seq (@seqs) {
    $h{ substr($seq, $_, 1) }++;
 } 
 push @residue_counts, \%h;
}

Now, for each elt in @residue_counts (each elt is a reference to a hash), look for the 
key that has the maximum hash value. The snippet above is also worth working
through for the educational value, esp. w/r to using hashes, which (IMHO) are one of
the absolutely coolest thing about Perl. 

cheers- MAJ
  ----- Original Message ----- 
  From: Dylan Krishnan 
  To: Mark A. Jensen 
  Cc: bioperl-l at lists.open-bio.org 
  Sent: Saturday, February 07, 2009 11:43 AM
  Subject: Re: [Bioperl-l] calculate the frequency of occurrence of the mostcommonly observed amino acid at each position of multiplesequence alignment

  thanks mark!

  the authors other approach is to load the alignment into a MS Excel worksheet and use the "autofilter" procedure to count the occurrences of any residue position of the alignment. the claim is "that excel is uselful for this purpose."sounds reasonable for 10 alignments but not 2000!

  again, many thanks.

  -dylan

  On Sat, Feb 7, 2009 at 10:25 AM, Mark A. Jensen <maj at fortinbras.us> wrote:

    Dylan,

    This is an extremely good exercise for anyone learning Perl to do bioinformatics.
    When you have done many exercises like this, you will see what people mean
    when they say it is very straightforward.

    Here are some hints:

    Use the "entropy" scrap at http://www.bioperl.org/wiki/Site_entropy_in_an_alignment .
    You will convert the function entropy_by_column() into the function you need.
    Replace the line

    $ent{$col} = entropy(values %res);

    with a line you will write using the "hash key at max value" scrap, found
    here: http://www.bioperl.org/wiki/Hash_key_at_the_max_value .

    Happy coding!
    Mark

    ----- Original Message ----- From: "Dylan Krishnan" <dylankrishnan at gmail.com>
    To: <bioperl-l at lists.open-bio.org>
    Sent: Saturday, February 07, 2009 11:10 AM
    Subject: [Bioperl-l] calculate the frequency of occurrence of the mostcommonly observed amino acid at each position of multiplesequence alignment

      I am new to perl but this is somethign I am seeking to do either through a
      bioperl module or just perl. Apparently, this is quite "straightforward
      using PERL," but I beg to differ.

      Any assistance regarding this matter would be greatly appreciated.

      Thanks!

      -dylan

      _______________________________________________
      Bioperl-l mailing list
      Bioperl-l at lists.open-bio.org
      http://lists.open-bio.org/mailman/listinfo/bioperl-l