[Bioperl-l] A pattern problem in Perl
Andrew Dalke
dalke at dalkescientific.com
Wed Apr 30 23:49:32 EDT 2003
darson:
> Those lines read as A:"Identities = 124/135 (91%), Gaps = 2/135 (1%)"
> or
> just B:"Identities = 124/135 (91%)". Both types coexists.
> I wrote the pattern matching script as "$judgecont=~m/Identities =
> (.*)\/.*?
> \((.*)%\)/;"
> My purpose is to grap the $1--(the numerator of fraction) and $2 (the
> percentage of Identities not Gaps)
> However according to the greed principle, I always gain the
> "1"(percentage
> of gaps) as in A situation.
You are still too greedy. Here's what I get with your pattern
% cat > input.dat
Identities = 124/135 (91%), Gaps = 2/135 (1%)
Identities = 124/135 (91%)\n
^D
% cat input.dat | perl -ne 'm/Identities = (.*)\/.*?\((.*)%\)/; print
"$1*$2*\n";'
124/135 (91%), Gaps = 2*1*
124*91*
%
Here's what's happening with the first line.
(.*) --- matches the "Identities = 124/135 (91%), Gaps = 2"
\/ --- matches the "/" in "2/135"
.*?\( --- matches the "135 ("
(.*) -- matches to the end, then backtracks so the % in the
next line will work. This matches the "1" in "1%"
%\) -- matches the "%)" at the very end of the line.
The failure occurs because the (.*) matches all the way
to the end, then backtracks until a pattern matches, and
finds one before it backtracks as far as you want it to go.
You want it to stop grabbing characters when it first reaches
the "/", which means you should make it be non-greedy, as in
% cat input.dat | perl -ne \
'm/Identities = (.*?)\/.*?\((.*?)%\)/; print "$1*$2*\n";'
124*91*
124*91*
%
At least, that's my understanding of what you want.
Also, I've found it better to be explict on what you want,
rather than depend on greediness, and don't put
too much faith in .* -- it'll bite back in strange ways.
Here's what I would have used
/Identities = (\d+)/\d+ \((\d)+%\)/
Andrew
dalke at dalkescientific.com
More information about the Bioperl-l
mailing list