[Bioperl-l] counting gaps in sequence data

Thu Oct 14 19:32:20 EDT 2004

Mike-

Something like this maybe?

use strict;
use warnings;

my %seqs = (human => "acgtt---cgatacg---acgact-----t",
            chimp => "acgtacgatac---actgca---ac",
            mouse => "acgata---acgatcg----acgt");

for my $seq (keys %seqs) { # An array of your sequences
  print "\n\nThe $seq sequence has the following gaps:\n";
  my %gaptype;
  for my $gap (1..5) { # 5 or however large you want gaps to be counted
    while ($seqs{$seq} =~ /[atgc]-{$gap}[atgc]/g) { #notice that this 
won't catch terminal gaps
      #This creates a hash of arrays.  The arrays hold the locations of the
      #gaps, and the count of each gaptype is determined by the length 
of that array.
      push @{$gaptype{$gap}}, $-[0] + 2;
    }
    if (defined @{$gaptype{$gap}}) {
      my $positions = join ", ", @{$gaptype{$gap}};
      print "\tGap length $gap begining at positions:\t$positions\n";
    }
  }
}

Barry Moore

Michael Robeson wrote:

> I have a set of data that looks something like the following:
>
> >human
> acgtt---cgatacg---acgact-----t
> >chimp
> acgtacgatac---actgca---ac
> >mouse
> acgata---acgatcg----acgt
>
> I am having trouble setting up a hash etc., to count the number and 
> types of continuous gaps. For example the 'human' sequence above has 2 
> sets of 3 gaps and 1 set of 5 gaps. The 'chimp' has 2 sets of 3 gaps 
> and finally the 'mouse' has 1 set of 3 gaps and 1 set of 4 gaps.
>
> So, I am having trouble being able to assign a dynamic variable (i.e. 
> gap length) and place that in a pattern match so that it can count how 
> many gaps of that length are in that particular sequence. I know how 
> to set up a hash to count the number of times a gap appears: 
> '$gaptype{$gap}++' or something. The problem is: what is the best way 
> (and how) can I set '$gap' to be dynamic.
>
> I need to know the length of each consecutive string of gaps. I know 
> how to count the gaps by using the 'tr' function. But it gets 
> confusing when I need to add counts to every instance of that gap 
> length. I also need to know the position of each gap (denoted by the 
> position of the first gap in that particular instance). I know that I 
> can use the 'pos()' command for this.
>
> So, my problem is that I think I know some of the bits of code to put 
> into place the problem is I am getting lost on how to structure it all 
> together. For now I am just trying to get my output to look like this:
>
> Human
> number of 3 base pair gaps:        2
>             at positions:        6, 16
> number of 5 base pair gaps:        1
>             at positions:        25
>
> Chimp
> .... and so on ...
>
> So, any suggestions would be greatly appreciated. If anyone can help 
> me out with all or even just bits of this I would greatly appreciate 
> it. This should help me get started on some more advanced parsing I 
> need to do after this. I like to try and figure things out on my own 
> if I can, so even pseudo code would be of great help!
>
> -Thanks
> -Mike
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

-- 
Barry Moore
Dept. of Human Genetics
University of Utah
Salt Lake City, UT