[Bioperl-l] Counting gaps in sequence data revisited.

Sun Oct 17 11:39:49 EDT 2004

I just wanted to thank everyone for their help and suggestions. This is 
the final full working code to count continuos gaps in a every sequence 
in a multi-sequence FASTA file. It may not be elegant but it is fast 
and works well. Sorry for the long post but I wanted to share this with 
those that do any DNA work. :-)

I show in order: the format of the input data, the output of the input 
data, and finally the working script. I have yet to add comments into 
the code but I am sure many of you veterans will figure it out.

-Thanks again all! As always comments, questions or suggestions are 
welcome!
-Mike

-------- INPUT --------
 >dog
atcg--acgat---act-ca----
 >cat
acgt-acgt----acgt-gt-agct-
 >mouse
---acgtacg-atcg---actgac-

------- OUTPUT ---------
***Discovered the following DNA sequences:***
dog found!
cat found!
mouse found!
                 >>>>>> mouse <<<<<<
Indel size:     1       Times found:    2
Positions:
11
25

Indel size:     3       Times found:    2
Positions:
1
16

                 >>>>>> cat <<<<<<
Indel size:     1       Times found:    4
Positions:
5
18
21
26

Indel size:     4       Times found:    1
Positions:
10

                 >>>>>> dog <<<<<<
Indel size:     1       Times found:    1
Positions:
18

Indel size:     2       Times found:    1
Positions:
5

Indel size:     3       Times found:    1
Positions:
12

Indel size:     4       Times found:    1
Positions:
21

------- Script -------
#!usr/bin/perl
# By Michael S. Robeson II, with the help of friends at lernperl.org 
and bioperl.org! :-)
# 10/16/2004

use warnings;
use strict;

###############################
# Open Sequence Data & OUTFILE
###############################

print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = <STDIN>);

open(DNA_SEQ, $dna_seq)
	or die "Can't open file: $!\n";

open(OUTFILE, ">indel_list_"."$dna_seq")
	or die "Can't open outfile: $!\n";

############################
# Read sequence data into a hash
############################

my %sequences;

$/ = '>';

print "\n***Discovered the following DNA sequences:***\n";

while ( <DNA_SEQ> ) {
	chomp;
	next unless s/^\s*(.+)//;
	my $name = $1;
	s/\s//g;
	$sequences{$name} = $_;
	print "$name found!\n";
}
close DNA_SEQ;

######################################
# iterate over gaps and write to file
######################################

foreach (keys %sequences) {
	print "\t\t\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";
	print OUTFILE "\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";
	my $dna = $sequences{$_};
	my %gap_data;
	my %position;
	while ($dna =~ /(\-+)/g) {
		my $gap_pos = pos ($dna) - length($&) + 1;
		my $gap_length = length $1; #$1 =~ tr/\-+//
		$gap_data{$gap_length}++;
		$position{$gap_length} .= "$gap_pos"." \n";
	}

	my @indels = keys (%gap_data);
	my @keys = sort { $a <=> $b} @indels;

	foreach my $key (@keys) {
		print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
		print OUTFILE "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
		print "Positions:\n";
		print OUTFILE "Positions:\n";
		print "$position{$key}";
		print OUTFILE "$position{$key}";
		print "\n";
		print OUTFILE "\n";
		}
	# Can replace the last "foreach loop" above with the while loop
	# below to do the same thing. Only Gap sizes will not be sorted.
	# nor is it set up to print to a file
	#	
	# while (my ($key, $vlaue)  = each (%gap_data)) {
	#	print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
	#	print "Positions:\n";
	#	print "$position{$key}";
	#	print "\n\n";
	# }

}