Bioperl: prosite patterns to SeqPattern
Andrew Dalke
dalke@bioreason.com
Wed, 02 Jun 1999 17:20:26 -0600
This is a multi-part message in MIME format.
--------------015BF09C69E64400198BB583
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Attached is a function to convert Prosite patterns (from the PA
line(s)) into regular expressions, as needed by SeqPattern or for
a normal regular expression search.
I haven't had much chance to regression test it. The best would
be to scan prosite and cross check it with the swissprot results,
but I don't have those data files handy to do the cross compare.
Anyone here want to eyeball that I did the conversion right?
The Prosite pattern format is given in
http://www.expasy.ch/txt/prosuser.txt . My code only converts
valid patterns and does not attempt to identify invalid patterns,
like "C-<-A".
Feel free to modify as needed to fit into the bioperl distribution.
Andrew
dalke@bioreason.com
--------------015BF09C69E64400198BB583
Content-Type: application/x-perl; name="prosite.pl"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename="prosite.pl"
#!/usr/local/bin/perl -w
$DOC = <<EOM;
The Prosite patterns are defined at http://www.expasy.ch/txt/prosuser.txt
The PA (PAttern) lines contains the definition of a PROSITE pattern. The
patterns are described using the following conventions:
- The standard IUPAC one-letter codes for the amino acids are used.
- The symbol `x' is used for a position where any amino acid is accepted.
- Ambiguities are indicated by listing the acceptable amino acids for a
given position, between square parentheses `[ ]'. For example: [ALT]
stands for Ala or Leu or Thr.
- Ambiguities are also indicated by listing between a pair of curly
brackets `{ }' the amino acids that are not accepted at a given
position. For example: {AM} stands for any amino acid except Ala and
Met.
- Each element in a pattern is separated from its neighbor by a `-'.
- Repetition of an element of the pattern can be indicated by following
that element with a numerical value or a numerical range between
parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to
x-x or x-x-x or x-x-x-x.
- When a pattern is restricted to either the N- or C-terminal of a
sequence, that pattern either starts with a `<' symbol or respectively
ends with a `>' symbol.
- A period ends the pattern.
That boils down to doing these conversions
[] -> []
{} -> [^]
- ->
() -> {}
< -> ^
> -> \$
x->X
. ->
EOM
$DOC = $DOC; # prevents warnings; don't know the perl way for long comments
%convert = (
#'[' => '[',
#']' => ']',
'{' => '[^',
'}' => ']',
'-' => '',
'(' => '{',
')' => '}',
'<' => '^',
'>' => '$',
'.' => '',
'x' => '.'
);
sub prosite_to_seqpattern {
my($prosite) = $_[0];
my($pat, $ch, $new_ch);
foreach $ch (split(//, $prosite)) {
$new_ch = $convert{$ch};
if (defined $new_ch) { # If I know how to translate a character,
$pat .= $new_ch; # use the converted character
} else {
$pat .= uc($ch); # else stick with the original
}
}
return $pat;
}
# some test code
sub compare_patterns {
my($a, $b) = @_;
if ($a ne $b) {
print "Got $b but should have gotten $a\n";
}
}
&compare_patterns("[AC].V.{4}[^ED]",
&prosite_to_seqpattern("[AC]-x-V-x(4)-{ED}."));
&compare_patterns("^A.[ST]{2}.{0,1}V",
&prosite_to_seqpattern("<A-x-[ST](2)-x(0,1)-V."));
&compare_patterns('[LIV]K.{2}[LIV].{2}LI[DEQ][KRHNQ].Y[LIVM].R.{6,7}[FY].Y.[SA]$',
&prosite_to_seqpattern(
"[LIV]-K-x(2)-[LIV]-x(2)-L-I-[DEQ]-[KRHNQ]".
"-x-Y-[LIVM]-x-R-x(6,7)-[FY]-x-Y-x-[SA]>."));
--------------015BF09C69E64400198BB583--
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================