Bioperl: prosite patterns to SeqPattern

Andrew Dalke
Wed, 02 Jun 1999 17:20:26 -0600

This is a multi-part message in MIME format.
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Attached is a function to convert Prosite patterns (from the PA
line(s)) into regular expressions, as needed by SeqPattern or for
a normal regular expression search.

I haven't had much chance to regression test it.  The best would
be to scan prosite and cross check it with the swissprot results,
but I don't have those data files handy to do the cross compare.

Anyone here want to eyeball that I did the conversion right?
The Prosite pattern format is given in .   My code only converts
valid patterns and does not attempt to identify invalid patterns,
like "C-<-A".

Feel free to modify as needed to fit into the bioperl distribution.

Content-Type: application/x-perl; name=""
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename=""

#!/usr/local/bin/perl -w
$DOC = <<EOM;

The Prosite patterns are defined at

The PA  (PAttern) lines  contains the definition of a PROSITE pattern. The
   patterns are described using the following conventions:

   -  The standard IUPAC one-letter codes for the amino acids are used.
   -  The symbol `x' is used for a position where any amino acid is accepted.
   -  Ambiguities are  indicated by  listing the acceptable amino acids for a
      given position,  between square  parentheses `[  ]'. For example: [ALT]
      stands for Ala or Leu or Thr.
   -  Ambiguities are  also indicated  by listing  between a  pair  of  curly
      brackets `{  }' the  amino acids  that are  not  accepted  at  a  given
      position. For  example: {AM}  stands for  any amino acid except Ala and
   -  Each element in a pattern is separated from its neighbor by a `-'.
   -  Repetition of  an element  of the pattern can be indicated by following
      that element  with a  numerical value  or  a  numerical  range  between
      parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to
      x-x or x-x-x or x-x-x-x.
   -  When a  pattern is  restricted to  either the  N- or  C-terminal  of  a
      sequence, that  pattern either starts with a `<' symbol or respectively
      ends with a `>' symbol.
   -  A period ends the pattern.

That boils down to doing these conversions

[] -> []
{} -> [^]
-  -> 
() -> {}
<  -> ^
>  -> \$
. -> 

$DOC = $DOC;  # prevents warnings; don't know the perl way for long comments

%convert = (
	    #'[' => '[',
	    #']' => ']',
	    '{' => '[^',
	    '}' => ']',
	    '-' => '',
	    '(' => '{',
	    ')' => '}',
	    '<' => '^',
	    '>' => '$',
	    '.' => '',
	    'x' => '.'

sub prosite_to_seqpattern {
  my($prosite) = $_[0];
  my($pat, $ch, $new_ch);
  foreach $ch (split(//, $prosite)) {
    $new_ch = $convert{$ch};
    if (defined $new_ch) {  # If I know how to translate a character,
      $pat .= $new_ch;      #  use the converted character
    } else {
      $pat .= uc($ch);      # else stick with the original
  return $pat;

# some test code

sub compare_patterns {
  my($a, $b) = @_;
  if ($a ne $b) {
    print "Got $b but should have gotten $a\n";





=========== Bioperl Project Mailing List Message Footer =======
Project URL:
For info about how to (un)subscribe, where messages are archived, etc: