what about the speed on longer seq? Re: [Bioperl-l] regular
Madeleine Lemieux
mlemieux at bioinfo.ca
Sat Jan 22 23:04:11 EST 2005
Below is a test string seeded with 3 instances of inverted repeats
flanked by direct repeats and some code to find all such patterns. It's
not as flexible as the EMBOSS palindrome finder nor is it a one-liner
but it finds perfect inverted repeats fast.
HTH,
Madeleine
---------------
#!/usr/bin/perl -w
my $test_string =
"GAAAATGGTTTAATCGGAAATTGAGTAGGAGGATAAAAGTCGCATGCTATTATAAATGAGATGCACTTTC
GACACCTCGCGGAAGTATATAAATGAAAGAAGCCCTCAGAAAACTTTAAATTGGAAATAGAGGGAAAATT
ACTGATGGTTGAAATCAGACCAAAATGGGATTGAAAGAGCCTTTCAGCCCTAGTGTGAGTGTCAGGTTTA
acgtgggtttatctcaaacccacgtCTCTTGTTGAAATCAGACCAAAATGGGATTGAAAGGTTTGTTAAGGG
CTTTGATTTGCTCCTCGGTGGCT
CTGGTTGAAATCAGACCAAAATGGGATTGAAAGTAAAGCAGTTCACCCCTGTTACTGGTTTAACTGCCTT
GTTGAAATCAGACCAAAATGGGATTGAAAGGTATTTGAATCAATGAAAAGAAATCTTACCTCGTCGTTGA
AATCAGACCAAAATGGGATTGAAAGAGTCTTCTGGATGGGTCACAAGGGAGACATCGAGGCGTTGAAATC
AGACCAAAATGGGATTGAAAGTCAGCAAGGTTACGTCGGAGATCCTCGAAGAGGGTATCAGTTGAAATCA
GACCAAAATGGGATTGAAAGCGAGGATTGCTGCCAAAGAGAGCGCCTCGTTCTTCGGTTGAAATCAGACC
AAAATGGGATTGAAAGAAAGTGAACATGCTTAAAGAAATGCTGACAGAAATTGAGTTGAAATCAGACCAA
AATGGGATTGAAAGAGCGAGGAAGAGCTTGACGAATTCTTCAAAAGCGGAGTTGAAATCAGACCAAAATG
GGATTGAAAGTTGCATTTACATCGGCAGAATTGGTCTCGTCGGAAGGCATGTTGAAATCAGACCAAAATG
tttaatatcaaAGCATgggaaaggatattCCAAaatatcctttcccGCATacatataccataGGATTGAAAG
CGGTTCTCTTACGTACTCATGCGAGAAGTGAGACTCGCGTTGGTTGAAATCAGACCAAAA
TGGGATTGAAAGAGCAAGTCGTGAAACTGAGCAGTCAAAACAGATCGTTAGTTGAAATCAGACCAAAATG
GGATTGAAAGTTTTCCCATACAATTACGACTTCGCCGGAAAAAAAGTTGAAATCAGACCAAAATGGGATT
GAAAGAGCGAGTTCGACCACGTCGTAGGTCTGCTGTCGGCAAGTTGAAATCAGACCAAAATGGGATTGAA
AGTGTTTGAAGTAGTTGAATACACCGTTGTGCTGTTTGTTGTTGAAATCAGACCAAAATGGGATTGAAAG
AGAGGGAGTATTAGGGCCATACTGGCCGGAGTTGTGGTTGTTGAAATCAGACCAAAATGGGATTGAAAGA
TTCCAAATTGCGGAAAAAGATTCGAGGGCAGTTACTTCCCGTTGAAATCAGACCAAAATGGGATTGAAAG
ccttgtgtacacccttACGTCGTTTATTGCCGTAACGCTAACACCATACTCAAGAGTTGAAATCAGACCAAA
ATGGGATTGAAAGA
AAGCCGTCCAGCGATTGTTTTCATCCGCACCGATAATAGGTTGAAATCAGACCAAAATGGGATTGAAAGG
GTTTAGACTTCCAGCAGGTAAGACATTCAAGGTTCGTTGAAATCAGACCAAAATGGGATTGAAAGGAGGT
AATAGCTGCGAGGGTCAAGCAGGTTTACGAGAAGTTGAAATCAGACCAAAATGGGATTGAAAGGAGCAAT";
# arbitrarily insist on direct and inverted repeats of at least 4 bases
long
while ( (length $test_string) > 15 ) {
$seq = lc $test_string;
# find direct repeats and work on the sequence between them
$seq =~ m/([acgt]{4,})(?=([acgtn]+)\1)/;
my $direct = $1;
my $middle_stuff = my $reverse_complement = $2;
if ($direct && $middle_stuff) {
$reverse_complement = reverse $reverse_complement;
$reverse_complement =~ tr/acgtn/tgcan/;
my $inverted = "";
my $char = "";
# starting from the position next to the direct repeat, build
up a string
# from the matching characters of the original sequence and its
rev_compl
# don't bother looking past mid_point of string
my $mid_point = (length $middle_stuff) / 2;
while ( ((length $middle_stuff) > $mid_point) &&
(($char = chop $middle_stuff) eq (chop
$reverse_complement)) ) {
$inverted = $inverted . $char;
}
if ( (length $inverted) > 3) {
if ($inverted =~ m/n/) {
print "possible inverted repeat found:
$inverted\nbetween $direct\n";
} else {
print "inverted repeat found: $inverted\nbetween
$direct\n";
}
print "substring length = ", length $test_string, "\n\n";
# last;
}
# step through the original string from the 2nd position of the
# current direct repeat
$seq =~ m/$direct/g;
my $newstart = pos($seq) - (length $direct) + 1;
$test_string = substr $test_string, $newstart;
} else {
last;
}
}
More information about the Bioperl-l
mailing list