[Bioperl-l] counting gaps in sequence data
Barry Moore
barry.moore at genetics.utah.edu
Thu Oct 14 19:32:20 EDT 2004
Mike-
Something like this maybe?
use strict;
use warnings;
my %seqs = (human => "acgtt---cgatacg---acgact-----t",
chimp => "acgtacgatac---actgca---ac",
mouse => "acgata---acgatcg----acgt");
for my $seq (keys %seqs) { # An array of your sequences
print "\n\nThe $seq sequence has the following gaps:\n";
my %gaptype;
for my $gap (1..5) { # 5 or however large you want gaps to be counted
while ($seqs{$seq} =~ /[atgc]-{$gap}[atgc]/g) { #notice that this
won't catch terminal gaps
#This creates a hash of arrays. The arrays hold the locations of the
#gaps, and the count of each gaptype is determined by the length
of that array.
push @{$gaptype{$gap}}, $-[0] + 2;
}
if (defined @{$gaptype{$gap}}) {
my $positions = join ", ", @{$gaptype{$gap}};
print "\tGap length $gap begining at positions:\t$positions\n";
}
}
}
Barry Moore
Michael Robeson wrote:
> I have a set of data that looks something like the following:
>
> >human
> acgtt---cgatacg---acgact-----t
> >chimp
> acgtacgatac---actgca---ac
> >mouse
> acgata---acgatcg----acgt
>
> I am having trouble setting up a hash etc., to count the number and
> types of continuous gaps. For example the 'human' sequence above has 2
> sets of 3 gaps and 1 set of 5 gaps. The 'chimp' has 2 sets of 3 gaps
> and finally the 'mouse' has 1 set of 3 gaps and 1 set of 4 gaps.
>
> So, I am having trouble being able to assign a dynamic variable (i.e.
> gap length) and place that in a pattern match so that it can count how
> many gaps of that length are in that particular sequence. I know how
> to set up a hash to count the number of times a gap appears:
> '$gaptype{$gap}++' or something. The problem is: what is the best way
> (and how) can I set '$gap' to be dynamic.
>
> I need to know the length of each consecutive string of gaps. I know
> how to count the gaps by using the 'tr' function. But it gets
> confusing when I need to add counts to every instance of that gap
> length. I also need to know the position of each gap (denoted by the
> position of the first gap in that particular instance). I know that I
> can use the 'pos()' command for this.
>
> So, my problem is that I think I know some of the bits of code to put
> into place the problem is I am getting lost on how to structure it all
> together. For now I am just trying to get my output to look like this:
>
> Human
> number of 3 base pair gaps: 2
> at positions: 6, 16
> number of 5 base pair gaps: 1
> at positions: 25
>
> Chimp
> .... and so on ...
>
> So, any suggestions would be greatly appreciated. If anyone can help
> me out with all or even just bits of this I would greatly appreciate
> it. This should help me get started on some more advanced parsing I
> need to do after this. I like to try and figure things out on my own
> if I can, so even pseudo code would be of great help!
>
> -Thanks
> -Mike
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
--
Barry Moore
Dept. of Human Genetics
University of Utah
Salt Lake City, UT
More information about the Bioperl-l
mailing list