[EMBOSS] Emma options - Checked by AntiVir DEMO version -

Guy Bottu gbottu at ben.vub.ac.be
Fri Feb 17 11:27:11 UTC 2006


On Wed, Feb 15, 2006 at 08:41:02PM +0200, Scott Hazelhurst wrote:
> As an aside, does anyone know the format of what the matrix file
> should look like. I did some web searching and looked at the clustalw
> source code but it's not so easy to re-engineer..

In case you did not already found it by yourself, here's the answer 
(copied from a user manaul I composed some time ago) :

-------------------------------------------------------------
Data files

   clustal uses symbol comparison matrices for scoring bases or amino acids.
   CLUSTAL has built-in symbol comparison matrices, but allows you to provide
   your own matrix. For proteins, but not for nucleic acids, you can give a
   series of matrices as input. You can choose different matrices for pairwise
   alignment and for multiple alignment.

  Single matrix input file

   The format used for a single matrix is the same as that used by the BLAST
   program. The scores in the new weight matrix should be similarities. You can
   use negative as well as positive values if you wish, although for proteins
   the matrix will be automatically adjusted to all positive scores, unless the
   -norescale option is selected. Any lines beginning with a # character are
   assumed to be comments. The first non-comment line should contain a list of
   bases or amino acids in any order, using the 1 letter code, followed by a *
   character. This should be followed by a square matrix of scores, with one
   row and one column for each base or amino acid. The last row and column of
   the matrix (corresponding to the * character) contain the minimum score over
   the whole matrix.

#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4
B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4
Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1

  Matrix series input format

   For proteins, CLUSTAL uses by default different matrices depending on the
   mean percent identity of the sequences to be aligned. For proteins, but not
   for nucleic acids, you can specify yourself a series of matrices and the
   range of the percent identity for each matrix in a matrix series file. The
   file is automatically recognised by the word CLUSTAL_SERIES at the beginning
   of the file. Each matrix in the series is then specified on one line which
   should start with the word MATRIX. This is followed by the lower and upper
   limits of the sequence percent identities for which you want to apply the
   matrix. The final entry on the matrix line is the filename of a BLAST format
   matrix file (see above for details of the single matrix file format).

CLUSTAL_SERIES

MATRIX 81 100 blosum80
MATRIX 61  80 blosum62
MATRIX 31  60 blosum45
MATRIX  0  30 blosum30

----------------------------------------------------------------------

	Regards,
	Guy Bottu,
	Belgian EMBnet Node




More information about the EMBOSS mailing list