what about the speed on longer seq? Re: [Bioperl-l] regular

Madeleine Lemieux mlemieux at bioinfo.ca
Sat Jan 22 23:04:11 EST 2005


Below is a test string seeded with 3 instances of inverted repeats  
flanked by direct repeats and some code to find all such patterns. It's  
not as flexible as the EMBOSS palindrome finder nor is it a one-liner  
but it finds perfect inverted repeats fast.

HTH,
Madeleine

---------------
#!/usr/bin/perl -w

my $test_string =  
"GAAAATGGTTTAATCGGAAATTGAGTAGGAGGATAAAAGTCGCATGCTATTATAAATGAGATGCACTTTC
GACACCTCGCGGAAGTATATAAATGAAAGAAGCCCTCAGAAAACTTTAAATTGGAAATAGAGGGAAAATT
ACTGATGGTTGAAATCAGACCAAAATGGGATTGAAAGAGCCTTTCAGCCCTAGTGTGAGTGTCAGGTTTA
acgtgggtttatctcaaacccacgtCTCTTGTTGAAATCAGACCAAAATGGGATTGAAAGGTTTGTTAAGGG 
CTTTGATTTGCTCCTCGGTGGCT
CTGGTTGAAATCAGACCAAAATGGGATTGAAAGTAAAGCAGTTCACCCCTGTTACTGGTTTAACTGCCTT
GTTGAAATCAGACCAAAATGGGATTGAAAGGTATTTGAATCAATGAAAAGAAATCTTACCTCGTCGTTGA
AATCAGACCAAAATGGGATTGAAAGAGTCTTCTGGATGGGTCACAAGGGAGACATCGAGGCGTTGAAATC
AGACCAAAATGGGATTGAAAGTCAGCAAGGTTACGTCGGAGATCCTCGAAGAGGGTATCAGTTGAAATCA
GACCAAAATGGGATTGAAAGCGAGGATTGCTGCCAAAGAGAGCGCCTCGTTCTTCGGTTGAAATCAGACC
AAAATGGGATTGAAAGAAAGTGAACATGCTTAAAGAAATGCTGACAGAAATTGAGTTGAAATCAGACCAA
AATGGGATTGAAAGAGCGAGGAAGAGCTTGACGAATTCTTCAAAAGCGGAGTTGAAATCAGACCAAAATG
GGATTGAAAGTTGCATTTACATCGGCAGAATTGGTCTCGTCGGAAGGCATGTTGAAATCAGACCAAAATG
tttaatatcaaAGCATgggaaaggatattCCAAaatatcctttcccGCATacatataccataGGATTGAAAG 
CGGTTCTCTTACGTACTCATGCGAGAAGTGAGACTCGCGTTGGTTGAAATCAGACCAAAA
TGGGATTGAAAGAGCAAGTCGTGAAACTGAGCAGTCAAAACAGATCGTTAGTTGAAATCAGACCAAAATG
GGATTGAAAGTTTTCCCATACAATTACGACTTCGCCGGAAAAAAAGTTGAAATCAGACCAAAATGGGATT
GAAAGAGCGAGTTCGACCACGTCGTAGGTCTGCTGTCGGCAAGTTGAAATCAGACCAAAATGGGATTGAA
AGTGTTTGAAGTAGTTGAATACACCGTTGTGCTGTTTGTTGTTGAAATCAGACCAAAATGGGATTGAAAG
AGAGGGAGTATTAGGGCCATACTGGCCGGAGTTGTGGTTGTTGAAATCAGACCAAAATGGGATTGAAAGA
TTCCAAATTGCGGAAAAAGATTCGAGGGCAGTTACTTCCCGTTGAAATCAGACCAAAATGGGATTGAAAG
ccttgtgtacacccttACGTCGTTTATTGCCGTAACGCTAACACCATACTCAAGAGTTGAAATCAGACCAAA 
ATGGGATTGAAAGA
AAGCCGTCCAGCGATTGTTTTCATCCGCACCGATAATAGGTTGAAATCAGACCAAAATGGGATTGAAAGG
GTTTAGACTTCCAGCAGGTAAGACATTCAAGGTTCGTTGAAATCAGACCAAAATGGGATTGAAAGGAGGT
AATAGCTGCGAGGGTCAAGCAGGTTTACGAGAAGTTGAAATCAGACCAAAATGGGATTGAAAGGAGCAAT";

# arbitrarily insist on direct and inverted repeats of at least 4 bases  
long
while ( (length $test_string) > 15 ) {
     $seq = lc $test_string;
     # find direct repeats and work on the sequence between them
     $seq =~ m/([acgt]{4,})(?=([acgtn]+)\1)/;
     my $direct = $1;
     my $middle_stuff = my $reverse_complement = $2;
     if ($direct && $middle_stuff) {
         $reverse_complement = reverse $reverse_complement;
         $reverse_complement =~ tr/acgtn/tgcan/;
         my $inverted = "";
         my $char = "";
         # starting from the position next to the direct repeat, build  
up a string
         # from the matching characters of the original sequence and its  
rev_compl
         # don't bother looking past mid_point of string
         my $mid_point = (length $middle_stuff) / 2;
         while ( ((length $middle_stuff) > $mid_point) &&
                 (($char = chop $middle_stuff) eq (chop  
$reverse_complement)) ) {
             $inverted = $inverted . $char;
         }
	   if ( (length $inverted) > 3) {
             if ($inverted =~ m/n/) {
                 print "possible inverted repeat found:  
$inverted\nbetween $direct\n";
             } else {
                 print "inverted repeat found: $inverted\nbetween  
$direct\n";
             }
             print "substring length = ", length $test_string, "\n\n";
#            last;
         }
         # step through the original string from the 2nd position of the
	   # current direct repeat
         $seq =~ m/$direct/g;
         my $newstart = pos($seq) - (length $direct) + 1;
         $test_string = substr $test_string, $newstart;
     } else {
         last;
     }
}



More information about the Bioperl-l mailing list