[Bioperl-l] regular expression help!

James D. White jdw at ou.edu
Fri Jan 21 11:54:37 EST 2005


Sorry about double posting, but I forgot to change the subject before
sending the first message.

> Starting with:
>
> $regex =~ /\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~ tr/ATCG/TAGC/i);})\1.*/i;
>
> The slashes in tr/// confused the Perl parser.  You need to use
> different delimiters for the m// operator (the m is implied by //)
> and the tr/// operator.  Also the tr/// operator does not use the
> i flag, so lower case needs to be handled explicitly.  So let's
> try the following:
>
> $regex =~ m:\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~ tr/ATCGatcg/TAGCtagc/);})\1.*:i;
>
> This gives the error:
> Can't modify constant item in transliteration (tr///) at (re_eval 1)
> line 1, near "tr/ATCGatcg/TAGCtagc/)"
>
> Inside the (??{ CODE }) sequence, use $1, $2, ..., instead of
> \1, \2, ... (See Programming Perl, 3rd Edition, "Match-time pattern
> interpolation", p. 213) Inside the evaluated CODE, \2 is a
> constant, not the value of the second captured substring.  Also I'm
> not sure what modifying $2 would do, so let's try:
>
> $regex =~ m:\S+(\S+)(\S{10}).*(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/; reverse($rev);})\1.*:i;
>
> This works, but I would get rid of the leading "\S+" and trailing
> ".*".  The ".*" adds nothing useful, so just drop it.  You
> probably don't need the leading "\S+", because the pattern is not
> anchored to the beginning of the string with "^".  The leading
> "\S+" gobbles up the entire string, forcing the match to backtrack
> character by character from the end.  It also forces the substring
> match saved in $1 to occur after the first character.  Unless you
> never want $1 to consider the first character, just drop the
> leading "\S+".  If you don't want to search the first character,
> then just use "\S".  This results in:
>
> $regex =~ m:(\S+)(\S{10}).*(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/; reverse($rev);})\1:i;
>
> Finally I would probably change the remaining ".*" to ".*?".  If
> you search with ".*" on a long sequence which could contain
> multiple sequences of interest, the ".*" pattern will match the rest
> of the sequence and force backtracking to match the first occurrence
> of "$1$2" with the last occurrence of "revcomp($2)$1".  If you use
> ".*?", you match the first occurrence of "$1$2" with the nearest
> occurrence of "revcomp($2)$1".  This results in the final regular
> expression:
>
> $regex =~ m:(\S+)(\S{10}).*?(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/; reverse($rev);})\1:i;
>
> > Date: Fri, 14 Jan 2005 14:12:46 -0500
> > From: Guojun Yang <gyang at plantbio.uga.edu>
> > Subject: [Bioperl-l] regular expression help!
> > To: bioperl-l at portal.open-bio.org
> > Message-ID: <20050114141246.94c7cb46 at dogwood.plantbio.uga.edu>
> > Content-Type: text/plain;       charset="us-ascii"
> >
> > Hi, Everybody,
> > I was trying to use a regex recognizing a patter of inverted repeat DNA seq flanked by direct repeats (see below), it returns errors saying "(?{...}) not terminated or {...} not balanced. Can anybody help me sorting this out?
> > The regex I have is:
> > $regex =~ /\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~ tr/ATCG/TAGC/i);})\1.*/i;
> > Thank you,
> > Yang
> >
>
> --
> James D. White   (jdw at ou.edu)
> Director of Bioinformatics
> Department of Chemistry and Biochemistry/ACGT
> University of Oklahoma
> 101 David L. Boren Blvd., SRTC 2100
> Norman, OK 73019
> Phone: (405) 325-4912, FAX: (405) 325-7762

--
James D. White   (jdw at ou.edu)
Director of Bioinformatics
Department of Chemistry and Biochemistry/ACGT
University of Oklahoma
101 David L. Boren Blvd., SRTC 2100
Norman, OK 73019
Phone: (405) 325-4912, FAX: (405) 325-7762





More information about the Bioperl-l mailing list