[Bioperl-l] seq_word and pattern counts

Staffa, Nick (NIH/NIEHS) [C] staffa at niehs.nih.gov
Thu Mar 9 16:17:32 UTC 2006


I did make a perl based script that works.
And I thank all who tried to help -- it did set me on a path to success.
Whereas I have a first-grade education in Perl
and, to me, object oriented programming concepts, 
the jargon et c. is like Greek. (I could go on about the complete nerdiness of it all)
It worked out better to create my own subroutines without using the shortcuts and tricks that make most perl scripts unintelligable, IMHO.
However, looking at the code in Bio::Tools::SeqWords gave me a hint on how to use pattern matching with the g qualifyer in a loop. (tho I thought it weird that to count a certain pattern, they make a hash of every possible n-mer and count them all)
The module on restriction enzymes gave me code to change ambiguity symbols into regular expression- something I could figure out, no doubt. 

And here it is. (I really should have kept the use of . for ANY character)

#!/usr/bin/perl

use strict;
use Bio::DB::Fasta;
use Bio::Tools::SeqWords;
use Bio::Seq;
my $pattern = @ARGV[0];
open (LIST,"<names_and_numbers.txt");
#  The GRAND LOOP thru all the chromosomes
my $filenameline;
my $filename;
my $sequenceID;
while ($filenameline = <LIST>) {
print $filenameline;
chomp $filenameline;
($filename, $sequenceID) = split(/\s+/, $filenameline);
print "filename=$filename  sequenceID=$sequenceID\n";
my $db = Bio::DB::Fasta->new("/home/staffa/clients/colaneria/Mouse_chromosomes/$filename",
                                -makeid => \&make_my_id);
my $obj = $db->get_Seq_by_id("$sequenceID");
my $start = 0;
my $windowsize = 10003;
my $enzstr = $pattern;
my $CG = "CG";
my $len = $obj->length;
my $overlap = 10000;
print "start=$start window=$windowsize string=$enzstr overlap=$overlap length=$len\n";
my @filenameparts = split(/_/,$filename);
my ($chrNo,$fa) = split(/\./,$filenameparts[2]);
my $outputfilename = "${chrNo}_$pattern.count";
print "out=$outputfilename\n";
open (OUTPUTFILE,">${outputfilename}.out");
while (1) {
  my $end = $start + $windowsize;
  last if($end > $len);
  my $subseq = $obj->subseq($start,$end);
  my $enzcount = &get_count($enzstr,$subseq);
  my $CGcount = &get_count($CG,$subseq);
  my $enzCGratio = 0;
  if ($CGcount > 0){$enzCGratio = $enzcount/$CGcount;}
  printf OUTPUTFILE "%d %d %d %d %g\n", $start,$end,$CGcount,$enzcount,$enzCGratio;
  $start += $overlap;
      }
      close OUTPUTFILE;
} 
sub get_count {
   my ($str,$subseq) = @_;
#   print "look for $str \n";
   my $pat = &expanded_string($str);
#   print "search pattern= $pat\n";
   my $count=0;

  while($subseq =~ /($pat)/gim){ 
                $count++; 
      }
#     print "count=$count\n";
      return $count;
       }

sub make_my_id {
    my $line = shift;
    $line =~ /gi\|(\d+)/;
    $1;
      }
      
   sub expanded_string  {
    my $str = @_[0];
        $str =~ s/N/\[ACGT\]/g;
        $str =~ s/R/\[AG\]/g;
        $str =~ s/Y/\[CT\]/g;
        $str =~ s/S/\[GC\]/g;
        $str =~ s/W/\[AT\]/g;
        $str =~ s/M/\[AC\]/g;
        $str =~ s/K/\[TG\]/g;
        $str =~ s/B/\[CGT\]/g;
        $str =~ s/D/\[AGT\]/g;
        $str =~ s/H/\[ACT\]/g;
        $str =~ s/V/\[ACG\]/g;
    return $str;
}




Nick Staffa
Telephone: 919-316-4569  (NIEHS: 6-4569)
Scientific Computing Support Group
NIEHS Information Technology Support Services Contract
(Science Task Monitor: Jack L. Field (field1 at niehs.nih.gov ))
National Institute of Environmental Health Sciences
National Institutes of Health
Research Triangle Park, North Carolina




-----Original Message-----
From: Torsten Seemann [mailto:torsten.seemann at infotech.monash.edu.au]
Sent: Tuesday, February 28, 2006 5:47 PM
To: Staffa, Nick (NIH/NIEHS) [C]
Cc: bioperl-l
Subject: Re: [Bioperl-l] seq_word and pattern counts


Staffa, Nick (NIH/NIEHS) [C] wrote:
> The real problem is this:
> We want to count sites in a long sequence where a restriction enzyme would cut.
> This restriction enzyme, in the example I gave will recognize GGnnCC,
> that is two G separated by two of any bases followed by two C.
> The GCG program findpatterns will do this, but bioperl makes certain statistics easy.
> I'm sure there is some module somewhere for this purpose. 

(Nick - please respond to me AND the bioperl-l at bioperl.org mailing list 
ie. "Reply All", so others can benefit from the Q&A - I've re-sent your 
past responses already).

Perhaps this module?

http://doc.bioperl.org/bioperl-live/Bio/Tools/RestrictionEnzyme.html

With this code?

my $enz = "GGNNCC";
my $re = new Bio::Tools::RestrictionEnzyme(-NAME =>"NicksResEnz--$enz",
	  			  	 -MAKE =>'custom');
@fragments = $re->cut_seq($seqobj);
print "$enz cuts ", $seqobj->display_id, " ", scalar(@fragments), " 
times.\n";

-- 
Torsten Seemann
Victorian Bioinformatics Consortium, Monash University, Australia
http://www.vicbioinformatics.com/
Phone: +61 3 9905 9010




More information about the Bioperl-l mailing list