[Bioperl-l] Counting gaps in sequence data revisited.
Michael S. Robeson II
popgen23 at mac.com
Sun Oct 17 11:39:49 EDT 2004
I just wanted to thank everyone for their help and suggestions. This is
the final full working code to count continuos gaps in a every sequence
in a multi-sequence FASTA file. It may not be elegant but it is fast
and works well. Sorry for the long post but I wanted to share this with
those that do any DNA work. :-)
I show in order: the format of the input data, the output of the input
data, and finally the working script. I have yet to add comments into
the code but I am sure many of you veterans will figure it out.
-Thanks again all! As always comments, questions or suggestions are
welcome!
-Mike
-------- INPUT --------
>dog
atcg--acgat---act-ca----
>cat
acgt-acgt----acgt-gt-agct-
>mouse
---acgtacg-atcg---actgac-
------- OUTPUT ---------
***Discovered the following DNA sequences:***
dog found!
cat found!
mouse found!
>>>>>> mouse <<<<<<
Indel size: 1 Times found: 2
Positions:
11
25
Indel size: 3 Times found: 2
Positions:
1
16
>>>>>> cat <<<<<<
Indel size: 1 Times found: 4
Positions:
5
18
21
26
Indel size: 4 Times found: 1
Positions:
10
>>>>>> dog <<<<<<
Indel size: 1 Times found: 1
Positions:
18
Indel size: 2 Times found: 1
Positions:
5
Indel size: 3 Times found: 1
Positions:
12
Indel size: 4 Times found: 1
Positions:
21
------- Script -------
#!usr/bin/perl
# By Michael S. Robeson II, with the help of friends at lernperl.org
and bioperl.org! :-)
# 10/16/2004
use warnings;
use strict;
###############################
# Open Sequence Data & OUTFILE
###############################
print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = <STDIN>);
open(DNA_SEQ, $dna_seq)
or die "Can't open file: $!\n";
open(OUTFILE, ">indel_list_"."$dna_seq")
or die "Can't open outfile: $!\n";
############################
# Read sequence data into a hash
############################
my %sequences;
$/ = '>';
print "\n***Discovered the following DNA sequences:***\n";
while ( <DNA_SEQ> ) {
chomp;
next unless s/^\s*(.+)//;
my $name = $1;
s/\s//g;
$sequences{$name} = $_;
print "$name found!\n";
}
close DNA_SEQ;
######################################
# iterate over gaps and write to file
######################################
foreach (keys %sequences) {
print "\t\t\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";
print OUTFILE "\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";
my $dna = $sequences{$_};
my %gap_data;
my %position;
while ($dna =~ /(\-+)/g) {
my $gap_pos = pos ($dna) - length($&) + 1;
my $gap_length = length $1; #$1 =~ tr/\-+//
$gap_data{$gap_length}++;
$position{$gap_length} .= "$gap_pos"." \n";
}
my @indels = keys (%gap_data);
my @keys = sort { $a <=> $b} @indels;
foreach my $key (@keys) {
print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
print OUTFILE "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
print "Positions:\n";
print OUTFILE "Positions:\n";
print "$position{$key}";
print OUTFILE "$position{$key}";
print "\n";
print OUTFILE "\n";
}
# Can replace the last "foreach loop" above with the while loop
# below to do the same thing. Only Gap sizes will not be sorted.
# nor is it set up to print to a file
#
# while (my ($key, $vlaue) = each (%gap_data)) {
# print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
# print "Positions:\n";
# print "$position{$key}";
# print "\n\n";
# }
}
More information about the Bioperl-l
mailing list