Bioperl: expert at reg. expressions: some patterns, thanks

Andrew Dalke dalke@bioreason.com
Thu, 08 Oct 1998 17:53:15 -0700


:  1.  pattern:
: >how to find QAQAQAQAQAQA in a protein sequences -- it's like finding
: iteration of "QA", but
: >can I make a regular expression that doesn't need a motif like "QA"
: >specified?
: 
: offered solution
: 
: Try  /(..){2,}/  or  /(..)$1+/
: 
: $1 will tell you what the dipeptide was. length($&)/2 will tell you the
: number of copies.

They won't do what I think you want them to do:

bioreason8> perl -e '$_ = "QAQAQAAAA"; /(..){2,}/; print "$1 $&\n"'
AA QAQAQAAA
bioreason8> perl -e '$_ = "QAQAQAAAAAAAA"; /(..){2,}/; print "$1 $&\n"'
AA QAQAQAAAAAAA
bioreason8> perl -e '$_ = "QAQAQAAAAAAAA"; /(..)$1+/; print "$1 $&\n"'
AA QAQAQAAAAAAA

Try instead "\1" to get:
bioreason8> perl -e '$_ = "QAQAQAAAAAAAA"; /(..)\1+/; print "$1 $&\n"'
QA QAQAQA

But that's arguably incorrect since the longest dipeptide repeat in
that sequence is "(AA)(AA)(AA)(AA)".  Perl's re implementation is
designed to find the first match, not the longest match.  (I believe
posix regular expressions may find the longest.)

You won't even get the second group of dipeptides in this case
from a
   while (/(..)\1+/g) {
     print "$1 $&\n"
   }
loop because the first pattern "QAQAQA" contains part of the second
pattern and the while(//g) construct doesn't allow overlaps in the
match.

You would have the same problems with "AQAQAQBQBQB" since the Q
in "AQB" overlaps.

| 2.  pattern:
| 
| I understand (R|H){6,} finds all combinations of tracts of R and H
| of lengths 6 or greater.  But if I want only "combination" tracts
| that are made of a combination of BOTH R and H, how do I write an
| RE to exclude tracts of ONLY R (R)n  and ONLY H (H)n.

The way to find any tract that contains at least one of R and H
is:
  [RH]*(RH|HR)[RH]*
but this will find expressions of length 2 or more.  What you want
is the intersection of

  /[RH]{6,}/ and /[RH]*(RH|HR)[RH]*/

which is a regular expression (regular expressions are closed
under "and") but as far as I am aware of this is not easily
expressible as in perl other than explicit enumeration, as in
the union of:

  R{1}H[RH]{4,}
  R{2}H[RH]{3,}
  R{3}H[RH]{2,}
  R{4}H[RH]{1,}
  R{5}H[RH]{0,}
  R{6}R*H[HR]*
  H{1}R[RH]{4,}
  H{2}R[RH]{3,}
  H{3}R[RH]{2,}
  H{4}R[RH]{1,}
  H{5}R[RH]{0,}
  H{6}H*R[HR]*

which can be reduced to
 ((R{1}H|H{1}R)[RH]{4,}) |
 ((R{2}H|H{2}R)[RH]{3,}) |
 ((R{3}H|H{3}R)[RH]{2,}) |
 ((R{4}H|H{4}R)[RH]{1,}) |
 ((R{5}H|H{5}R)[RH]{0,}) |
 ((R{6}R*H)|(H{6}H*R))[HR]*

For a given N, this can be easily generated algorithmatically.

If you don't want it as a re, you can do tricks like:

  /[RH]{6,}/ && $& =~ /[RH]*(RH|HR)[RH]*/

Or to get information about where in the sequence it matched
(the following is UNTESTED):
  ($text) = /([RH]{6,})/;
  ($beforelen, $afterlen) = (length($`), length($'))
  if ($text =~ /HR|RH/) {  # Note simplification since I know the
                           # match only has Hs and Rs
     # do something here
  }

: 3. can I find a tract of Q (of minimum length N) followed by no more
: than X amino acids before another tract of Q (of minimum length N)
: is found again?
: For example, to find:
: 
: AGTWRWDFDQQQQQQQQFAFCRCFCFAFAFCRFQQQQQQQQQQQQQ

   /(Q{$n,})([^Q]{$x,})(Q{$n,})/

bioreason8> perl -e '$n=5; $x=9; \
    $_="AGTWRWDFDQQQQQQQQFAFCRCFCFAFAFCRFQQQQQQQQQQQQQ"; \
    print "|$1|$2|$3|\n" if /(Q{$n,})([^Q]{$x,})(Q{$n,})/'
|QQQQQQQQ|FAFCRCFCFAFAFCRF|QQQQQQQQQQQQQ|


Same caveats as with the first one -- you'll need to be careful
about
   QQQXXXQQQXXXQQQ
because there are two equally "interesting" regions in that sequence.


> 4. how do I find tracts of an identical amino acid that are flanked at
> either end with the same amino acid...
> Good at: HTTTTTTTTTTH or  TGGGGGGGGGGGT

  /(.)[^\1]+\1/
seems to work.  (Frankly, I was surprised.  I didn't realize you
could put the "\1" in [].)

bioreason8> perl -ne 'print "OKAY|$1|\n" if /((.)[^\2]+\2)/'
HTTH
OKAY|HTTH|
HTQ
TGGGGGGGGGGGT
OKAY|TGGGGGGGGGGGT|
HTTTTTTTTTTH
OKAY|HTTTTTTTTTTH|
HTTTHRRRRR
OKAY|HTTTH|

Again, same warnings.  Consider what you would find with "HTTTHHHT".

						Andrew
						dalke@bioreason.com
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================