[Bioperl-l] Re: Bioperl-l Digest, Vol 21, Issue 12
James D. White
jdw at ou.edu
Fri Jan 21 11:47:22 EST 2005
Starting with:
$regex =~ /\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~ tr/ATCG/TAGC/i);})\1.*/i;
The slashes in tr/// confused the Perl parser. You need to use
different delimiters for the m// operator (the m is implied by //)
and the tr/// operator. Also the tr/// operator does not use the
i flag, so lower case needs to be handled explicitly. So let's
try the following:
$regex =~ m:\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~ tr/ATCGatcg/TAGCtagc/);})\1.*:i;
This gives the error:
Can't modify constant item in transliteration (tr///) at (re_eval 1)
line 1, near "tr/ATCGatcg/TAGCtagc/)"
Inside the (??{ CODE }) sequence, use $1, $2, ..., instead of
\1, \2, ... (See Programming Perl, 3rd Edition, "Match-time pattern
interpolation", p. 213) Inside the evaluated CODE, \2 is a
constant, not the value of the second captured substring. Also I'm
not sure what modifying $2 would do, so let's try:
$regex =~ m:\S+(\S+)(\S{10}).*(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/; reverse($rev);})\1.*:i;
This works, but I would get rid of the leading "\S+" and trailing
".*". The ".*" adds nothing useful, so just drop it. You
probably don't need the leading "\S+", because the pattern is not
anchored to the beginning of the string with "^". The leading
"\S+" gobbles up the entire string, forcing the match to backtrack
character by character from the end. It also forces the substring
match saved in $1 to occur after the first character. Unless you
never want $1 to consider the first character, just drop the
leading "\S+". If you don't want to search the first character,
then just use "\S". This results in:
$regex =~ m:(\S+)(\S{10}).*(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/; reverse($rev);})\1:i;
Finally I would probably change the remaining ".*" to ".*?". If
you search with ".*" on a long sequence which could contain
multiple sequences of interest, the ".*" pattern will match the rest
of the sequence and force backtracking to match the first occurrence
of "$1$2" with the last occurrence of "revcomp($2)$1". If you use
".*?", you match the first occurrence of "$1$2" with the nearest
occurrence of "revcomp($2)$1". This results in the final regular
expression:
$regex =~ m:(\S+)(\S{10}).*?(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/; reverse($rev);})\1:i;
> Date: Fri, 14 Jan 2005 14:12:46 -0500
> From: Guojun Yang <gyang at plantbio.uga.edu>
> Subject: [Bioperl-l] regular expression help!
> To: bioperl-l at portal.open-bio.org
> Message-ID: <20050114141246.94c7cb46 at dogwood.plantbio.uga.edu>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi, Everybody,
> I was trying to use a regex recognizing a patter of inverted repeat DNA seq flanked by direct repeats (see below), it returns errors saying "(?{...}) not terminated or {...} not balanced. Can anybody help me sorting this out?
> The regex I have is:
> $regex =~ /\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~ tr/ATCG/TAGC/i);})\1.*/i;
> Thank you,
> Yang
>
--
James D. White (jdw at ou.edu)
Director of Bioinformatics
Department of Chemistry and Biochemistry/ACGT
University of Oklahoma
101 David L. Boren Blvd., SRTC 2100
Norman, OK 73019
Phone: (405) 325-4912, FAX: (405) 325-7762
More information about the Bioperl-l
mailing list