[Bioperl-l] gap position script
Michael Robeson
popgen23 at mac.com
Fri Nov 12 13:48:47 EST 2004
Hi all,
Below is the full code that will tally up the gap-size and positions
for an aligned data set in FASTA format. More descriptions are in the
script below. As always feel free to send me comments!
-Cheers!
-Mike
--------
http://homepage.mac.com/popgen23
<-- begin code -->
#!usr/bin/perl
#
# By Mike Robeson November 12, 2004. This script could not
# have been done w/o the help of the wonderful folks at
# bioperl.org and learnperl.org.
#
# This script will take an aligned sequence file in FASTA
# format and tally up the sizes and positions of variously
# sized gaps. The output is in the form of a matrix that
# denotes the presence or absence of each gap type by using
# 1s (present) and 0s (absent). File is output as a csv file
# for easy viewing in a spread sheet (e.g. indel_list_$file.csv).
#
# Example output:
#
# Species:,mouse,rat,human
# 1_20,1,0,1
# 2_34,1,0,1
# 3_65,0,1,0
#
# Note: The first row denotes the species for each column of
# data below it. After which, the info before the first ',' is
# the gap-size followed by it's position (i.e. 1_20 means that
# it is a 1 bp gap at the 20th position in the sequence. Followed
# by its presence or absence in each species.
#
# For an easier reading output file just change some of the
# ',' to '\t' in the script.
#
use warnings;
use strict;
my (@animals, $animal, %gap);
##########################
# Request file for input
##########################
print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = <STDIN>);
open(DNA_SEQ, $dna_seq)
or die "Can't open file: $!\n";
$dna_seq =~ s/\..+//g;
# The outfile name is made by automatically made by
# adding "indel_list_" to the front of
# the name of your input file.
open(OUTFILE, ">indel_list_"."$dna_seq"."\.csv")
or die "Can't open outfile: $!\n";
##########################
# Read Data into Hash
##########################
$/ = '>';
while (<DNA_SEQ>) {
chomp;
next unless s/^\s*(.+)//;
$animal = $1;
push @animals, $animal;
$_ =~ s/\s//g;
while (/(-+)/g) {
my $gap_length = length $1;
my $position = pos() - $gap_length +1;
$gap{$gap_length.'_'. $position}{$animal} = 1;
}
}
######################################
# Print out tab delimited data matrix
######################################
print OUTFILE "Species:";
foreach my $animal (@animals) {
print "Processing: $animal\n";
print OUTFILE ",$animal";
}
print OUTFILE "\n";
foreach my $state (sort keys %gap) {
print OUTFILE "$state:";
foreach my $animal (@animals) {
if (defined $gap{$state}{$animal} ) {
print OUTFILE ",1";
} else {
print OUTFILE ",0";
}
}
print OUTFILE "\n";
}
print "\n***** Finished *****\n\n";
<-- end code -->
More information about the Bioperl-l
mailing list