Bioperl: prosite patterns to SeqPattern

Andrew Dalke dalke@bioreason.com
Wed, 02 Jun 1999 17:20:26 -0600


This is a multi-part message in MIME format.
--------------015BF09C69E64400198BB583
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Attached is a function to convert Prosite patterns (from the PA
line(s)) into regular expressions, as needed by SeqPattern or for
a normal regular expression search.

I haven't had much chance to regression test it.  The best would
be to scan prosite and cross check it with the swissprot results,
but I don't have those data files handy to do the cross compare.

Anyone here want to eyeball that I did the conversion right?
The Prosite pattern format is given in
http://www.expasy.ch/txt/prosuser.txt .   My code only converts
valid patterns and does not attempt to identify invalid patterns,
like "C-<-A".

Feel free to modify as needed to fit into the bioperl distribution.

						Andrew
						dalke@bioreason.com
--------------015BF09C69E64400198BB583
Content-Type: application/x-perl; name="prosite.pl"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename="prosite.pl"

#!/usr/local/bin/perl -w
$DOC = <<EOM;

The Prosite patterns are defined at http://www.expasy.ch/txt/prosuser.txt

The PA  (PAttern) lines  contains the definition of a PROSITE pattern. The
   patterns are described using the following conventions:

   -  The standard IUPAC one-letter codes for the amino acids are used.
   -  The symbol `x' is used for a position where any amino acid is accepted.
   -  Ambiguities are  indicated by  listing the acceptable amino acids for a
      given position,  between square  parentheses `[  ]'. For example: [ALT]
      stands for Ala or Leu or Thr.
   -  Ambiguities are  also indicated  by listing  between a  pair  of  curly
      brackets `{  }' the  amino acids  that are  not  accepted  at  a  given
      position. For  example: {AM}  stands for  any amino acid except Ala and
      Met.
   -  Each element in a pattern is separated from its neighbor by a `-'.
   -  Repetition of  an element  of the pattern can be indicated by following
      that element  with a  numerical value  or  a  numerical  range  between
      parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to
      x-x or x-x-x or x-x-x-x.
   -  When a  pattern is  restricted to  either the  N- or  C-terminal  of  a
      sequence, that  pattern either starts with a `<' symbol or respectively
      ends with a `>' symbol.
   -  A period ends the pattern.

That boils down to doing these conversions

[] -> []
{} -> [^]
-  -> 
() -> {}
<  -> ^
>  -> \$
x->X
. -> 

EOM
$DOC = $DOC;  # prevents warnings; don't know the perl way for long comments

%convert = (
	    #'[' => '[',
	    #']' => ']',
	    '{' => '[^',
	    '}' => ']',
	    '-' => '',
	    '(' => '{',
	    ')' => '}',
	    '<' => '^',
	    '>' => '$',
	    '.' => '',
	    'x' => '.'
	   );

sub prosite_to_seqpattern {
  my($prosite) = $_[0];
  my($pat, $ch, $new_ch);
  foreach $ch (split(//, $prosite)) {
    $new_ch = $convert{$ch};
    if (defined $new_ch) {  # If I know how to translate a character,
      $pat .= $new_ch;      #  use the converted character
    } else {
      $pat .= uc($ch);      # else stick with the original
    }
  }
  return $pat;
}

# some test code

sub compare_patterns {
  my($a, $b) = @_;
  if ($a ne $b) {
    print "Got $b but should have gotten $a\n";
  }
}

&compare_patterns("[AC].V.{4}[^ED]",
		  &prosite_to_seqpattern("[AC]-x-V-x(4)-{ED}."));

&compare_patterns("^A.[ST]{2}.{0,1}V",
		  &prosite_to_seqpattern("<A-x-[ST](2)-x(0,1)-V."));

&compare_patterns('[LIV]K.{2}[LIV].{2}LI[DEQ][KRHNQ].Y[LIVM].R.{6,7}[FY].Y.[SA]$',
		  &prosite_to_seqpattern(
		     "[LIV]-K-x(2)-[LIV]-x(2)-L-I-[DEQ]-[KRHNQ]".
		     "-x-Y-[LIVM]-x-R-x(6,7)-[FY]-x-Y-x-[SA]>."));

--------------015BF09C69E64400198BB583--

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================