[Bioperl-l] Re: Bioperl-l Digest, Vol 21, Issue 12

Fri Jan 21 11:47:22 EST 2005

Starting with:

$regex =~ /\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~ tr/ATCG/TAGC/i);})\1.*/i;

The slashes in tr/// confused the Perl parser.  You need to use
different delimiters for the m// operator (the m is implied by //)
and the tr/// operator.  Also the tr/// operator does not use the
i flag, so lower case needs to be handled explicitly.  So let's
try the following:

$regex =~ m:\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~ tr/ATCGatcg/TAGCtagc/);})\1.*:i;

This gives the error:
Can't modify constant item in transliteration (tr///) at (re_eval 1)
line 1, near "tr/ATCGatcg/TAGCtagc/)"

Inside the (??{ CODE }) sequence, use $1, $2, ..., instead of
\1, \2, ... (See Programming Perl, 3rd Edition, "Match-time pattern
interpolation", p. 213) Inside the evaluated CODE, \2 is a
constant, not the value of the second captured substring.  Also I'm
not sure what modifying $2 would do, so let's try:

$regex =~ m:\S+(\S+)(\S{10}).*(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/; reverse($rev);})\1.*:i;

This works, but I would get rid of the leading "\S+" and trailing
".*".  The ".*" adds nothing useful, so just drop it.  You
probably don't need the leading "\S+", because the pattern is not
anchored to the beginning of the string with "^".  The leading
"\S+" gobbles up the entire string, forcing the match to backtrack
character by character from the end.  It also forces the substring
match saved in $1 to occur after the first character.  Unless you
never want $1 to consider the first character, just drop the
leading "\S+".  If you don't want to search the first character,
then just use "\S".  This results in:

$regex =~ m:(\S+)(\S{10}).*(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/; reverse($rev);})\1:i;

Finally I would probably change the remaining ".*" to ".*?".  If
you search with ".*" on a long sequence which could contain
multiple sequences of interest, the ".*" pattern will match the rest
of the sequence and force backtracking to match the first occurrence
of "$1$2" with the last occurrence of "revcomp($2)$1".  If you use
".*?", you match the first occurrence of "$1$2" with the nearest
occurrence of "revcomp($2)$1".  This results in the final regular
expression:

$regex =~ m:(\S+)(\S{10}).*?(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/; reverse($rev);})\1:i;

> Date: Fri, 14 Jan 2005 14:12:46 -0500
> From: Guojun Yang <gyang at plantbio.uga.edu>
> Subject: [Bioperl-l] regular expression help!
> To: bioperl-l at portal.open-bio.org
> Message-ID: <20050114141246.94c7cb46 at dogwood.plantbio.uga.edu>
> Content-Type: text/plain;       charset="us-ascii"
>
> Hi, Everybody,
> I was trying to use a regex recognizing a patter of inverted repeat DNA seq flanked by direct repeats (see below), it returns errors saying "(?{...}) not terminated or {...} not balanced. Can anybody help me sorting this out?
> The regex I have is:
> $regex =~ /\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~ tr/ATCG/TAGC/i);})\1.*/i;
> Thank you,
> Yang
>

--
James D. White   (jdw at ou.edu)
Director of Bioinformatics
Department of Chemistry and Biochemistry/ACGT
University of Oklahoma
101 David L. Boren Blvd., SRTC 2100
Norman, OK 73019
Phone: (405) 325-4912, FAX: (405) 325-7762