[Bioperl-l] A pattern problem in Perl

Andrew Dalke dalke at dalkescientific.com
Wed Apr 30 23:49:32 EDT 2003


darson:
> Those lines read as A:"Identities = 124/135 (91%), Gaps = 2/135 (1%)" 
> or
> just B:"Identities = 124/135 (91%)". Both types coexists.
> I wrote the pattern matching script as "$judgecont=~m/Identities = 
> (.*)\/.*?
> \((.*)%\)/;"
> My purpose is to grap the $1--(the numerator of fraction) and $2 (the
> percentage of Identities not Gaps)
> However according to the greed principle, I always gain the 
> "1"(percentage
> of gaps) as in A situation.

You are still too greedy.  Here's what I get with your pattern

% cat > input.dat
Identities = 124/135 (91%), Gaps = 2/135 (1%)
Identities = 124/135 (91%)\n
^D
% cat input.dat | perl -ne 'm/Identities = (.*)\/.*?\((.*)%\)/; print 
"$1*$2*\n";'
124/135 (91%), Gaps = 2*1*
124*91*
%

Here's what's happening with the first line.
   (.*)  --- matches the "Identities = 124/135 (91%), Gaps = 2"
   \/    --- matches the "/" in "2/135"
   .*?\(  --- matches the "135 ("
   (.*)  -- matches to the end, then backtracks so the % in the
         next line will work.  This matches the "1" in "1%"
   %\)  -- matches the "%)" at the very end of the line.

The failure occurs because the (.*) matches all the way
to the end, then backtracks until a pattern matches, and
finds one before it backtracks as far as you want it to go.

You want it to stop grabbing characters when it first reaches
the "/", which means you should make it be non-greedy, as in

% cat input.dat | perl -ne \
'm/Identities = (.*?)\/.*?\((.*?)%\)/; print "$1*$2*\n";'
124*91*
124*91*
%

At least, that's my understanding of what you want.

Also, I've found it better to be explict on what you want,
rather than depend on greediness, and don't put
too much faith in .* -- it'll bite back in strange ways.
Here's what I would have used

   /Identities = (\d+)/\d+ \((\d)+%\)/

					Andrew
					dalke at dalkescientific.com



More information about the Bioperl-l mailing list