[Bioperl-l] counting gaps in a column of alignment

Fri Jul 26 04:04:21 UTC 2013

This is pretty inefficient - the multiple calls to slice - if you don't need the random access aspect of slice, but instead are going to process every column, I would just operate on the strings themselves.

Look at how gap_col_matrix function is implemented  - it takes all the strings from the seqs and does the calculation in one pass.

you just want to walk through each sequence and then each position in each sequence and update an array which has a hash in each slot. The hash contains the nucleotide counts e.g.

 $counts[0]->{A}  would tell you the number of A's in Column 0.

you would build up this datastructure with 

for my $seq ( $aln->each_seq ) {
 my @seqarray = split(//,$seq->seq);
  for my $col ( 0..$aln->length ) {
   my $base = $seqarray[$col]
    $counts[$col]->{$base}++;
  }
}

Or you can use substr instead of split to operate on the string itsself.

On Jul 24, 2013, at 9:37 PM, subha kalyanamoorthy <sksweety24 at gmail.com> wrote:

> Hi there,
> 
> I am a new bioperl user.  I made a script to count the nucleotide
> composition in each column of the alignment. I am able to get the count for
> the nucleotides, but not for the gap characters from the following program.
> I would greatly appreciate any suggestions to correct this program.
> 
> Thanks.
> 
> #***************************My Program**********************************
> 
> #!/bin/perl -w
> use strict;
> use warnings;
> 
> use List::Util 'max';
> use Bio::SimpleAlign;
> use Bio::Align::AlignI;
> use Bio::AlignIO;
> use Bio::SeqIO;
> 
> my $in= Bio::AlignIO->new( -file => "seq.fst", -format => "fasta");
> 
> my $align = $in->next_aln();
> 
> print "column\tA's\tT's\C's\G's\n";
> 
> for (my $i = 1; $i <= $align->length; $i++) {
> 
> my %count;
> 
> my $seqs = $align->slice($i,$i);
> 
> my $gap_char = $seqs->gap_char();
> 
>    my $count_A=0;
>    my $count_C=0;
>    my $count_T=0;
>    my $count_G=0;
>    my $count_N=0;
>    my $count_gap=0;
> 
>    foreach my $seq ($seqs->each_seq) {
> 
>    my $col=$seq->seq;
> 
>        if ($col eq 'A'){
>        $count_A++;
>        }elsif ($col eq 'C'){
>        $count_C++;
>        }elsif ($col eq 'T'){
>        $count_T++;
>        }elsif ($col eq 'G'){
>        $count_G++;
>        }elsif ( $col eq 'N'){
>        $count_N++;
>        }elsif ($col =~ m/^\Q$gap_char\E$/){
>        $count_gap++;
> }
>      $count{$seq->seq} += 1;
> }
> 
> print"$i\t$count_A\t$count_T\t$count_C\t$count_G\n";
> 
> }
> 
> #***********************************************************************`
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org