[Bioperl-l] gap position script

Fri Nov 12 13:48:47 EST 2004

Hi all,

Below is the full code that will tally up the gap-size and positions 
for an aligned data set in FASTA format. More descriptions are in the 
script below. As always feel free to send me comments!

-Cheers!
-Mike
--------
http://homepage.mac.com/popgen23

<-- begin code -->

#!usr/bin/perl
#
# By Mike Robeson November 12, 2004. This script could not
# have been done w/o the help of the wonderful folks at
# bioperl.org and learnperl.org.
#
# This script will take an aligned sequence file in FASTA
# format and tally up the sizes and positions of variously
# sized gaps. The output is in the form of a matrix that
# denotes the presence or absence of each gap type by using
# 1s (present) and 0s (absent). File is output as a csv file
# for easy viewing in a spread sheet (e.g. indel_list_$file.csv).
#
# Example output:
#
# Species:,mouse,rat,human
# 1_20,1,0,1
# 2_34,1,0,1
# 3_65,0,1,0
#
# Note: The first row denotes the species for each column of
# data below it. After which, the info before the first ',' is
# the gap-size followed by it's position (i.e. 1_20 means that
# it is a 1 bp gap at the 20th position in the sequence. Followed
# by its presence or absence in each species.
#
# For an easier reading output file just change some of the
# ',' to '\t' in the script.
#

use warnings;
use strict;

my (@animals, $animal, %gap);

##########################
# Request file for input
##########################

print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = <STDIN>);

open(DNA_SEQ, $dna_seq)
	or die "Can't open file: $!\n";

$dna_seq =~ s/\..+//g;

# The outfile name is made by automatically made by
# adding "indel_list_" to the front of
# the name of your input file.

open(OUTFILE, ">indel_list_"."$dna_seq"."\.csv")
	or die "Can't open outfile: $!\n";

##########################
# Read Data into Hash
##########################

$/ = '>';

while (<DNA_SEQ>) {
      chomp;
      next unless s/^\s*(.+)//;
      $animal = $1;
      push @animals, $animal;
      $_ =~ s/\s//g;
      while (/(-+)/g) {
           my $gap_length = length $1;
           my $position = pos() - $gap_length +1;
           $gap{$gap_length.'_'. $position}{$animal} = 1;
      }
}

######################################
# Print out tab delimited data matrix
######################################

print OUTFILE "Species:";

foreach my $animal (@animals) {
	print "Processing: $animal\n";
	print OUTFILE ",$animal";
}

print OUTFILE "\n";

foreach my $state (sort keys %gap) {
	print OUTFILE "$state:";
     foreach  my $animal (@animals) {
           if (defined $gap{$state}{$animal} ) {
                print OUTFILE ",1";
           } else {
                print OUTFILE ",0";
           }
      }
print OUTFILE "\n";
}

print "\n***** Finished *****\n\n";

<-- end code -->