[Bioperl-l] counting gaps in sequence data

Thu Oct 14 19:11:26 EDT 2004

I don't know if this is the best way, but it is a way, though it won't 
get you the exact positions.

If your sequence was a continuous string with no new lines or anything,; 
(Can also be an array of sequences, etc, but simplifying it for the 
example)

my $human_sequence = "acgtt---cgatacg---acgact-----t ";;

my (@gaps) = $human_sequence =~ /(\-+)/g;
my %gap_lengths;

foreach my $gap_seq (@gaps)
{
   $gap_lengths{length($gap_seq});
}

while(my ($key, $val) = each(%gap_lengths))
{
   print "$val gaps of length $key\n";
}

As an aside, you can also get the locations now that I think of it. if 
you do

my @match_coords;
while($human_seqence =~ /(\-+)/g)
{
   $gap_lengths{length($gap_seq});

   # You can get the end of the position of the last match with (the 
last dash in the continuous string of dashes)
   # You can subtract the length of the match +1 to get the position of 
the first match also.
   push(@match_coords(pos($human_sequence));
}

Basicaly I think you can see here what you can do. if you want the gap 
positions stored differently you can do that, so you can figure out how 
big the gap is at any position, etc.

Hope this helps

Michael Cipriano

Michael Robeson wrote:

> I have a set of data that looks something like the following:
>
> >human
> acgtt---cgatacg---acgact-----t
> >chimp
> acgtacgatac---actgca---ac
> >mouse
> acgata---acgatcg----acgt
>
> I am having trouble setting up a hash etc., to count the number and 
> types of continuous gaps. For example the 'human' sequence above has 2 
> sets of 3 gaps and 1 set of 5 gaps. The 'chimp' has 2 sets of 3 gaps 
> and finally the 'mouse' has 1 set of 3 gaps and 1 set of 4 gaps.
>
> So, I am having trouble being able to assign a dynamic variable (i.e. 
> gap length) and place that in a pattern match so that it can count how 
> many gaps of that length are in that particular sequence. I know how 
> to set up a hash to count the number of times a gap appears: 
> '$gaptype{$gap}++' or something. The problem is: what is the best way 
> (and how) can I set '$gap' to be dynamic.
>
> I need to know the length of each consecutive string of gaps. I know 
> how to count the gaps by using the 'tr' function. But it gets 
> confusing when I need to add counts to every instance of that gap 
> length. I also need to know the position of each gap (denoted by the 
> position of the first gap in that particular instance). I know that I 
> can use the 'pos()' command for this.
>
> So, my problem is that I think I know some of the bits of code to put 
> into place the problem is I am getting lost on how to structure it all 
> together. For now I am just trying to get my output to look like this:
>
> Human
> number of 3 base pair gaps:        2
>             at positions:        6, 16
> number of 5 base pair gaps:        1
>             at positions:        25
>
> Chimp
> .... and so on ...
>
> So, any suggestions would be greatly appreciated. If anyone can help 
> me out with all or even just bits of this I would greatly appreciate 
> it. This should help me get started on some more advanced parsing I 
> need to do after this. I like to try and figure things out on my own 
> if I can, so even pseudo code would be of great help!
>
> -Thanks
> -Mike
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l