[Biopython-dev] Method to weight sequences in an alignment
eric.talevich at gmail.com
Mon Apr 9 18:25:04 UTC 2012
I've written a function to weight sequences according to the simple scheme
used in PSI-BLAST [*]. It operates on Bio.Align.MultipleSeqAlignment
objects or lists of plain strings, and could be added as a method with
minimal changes (for Python 2.5 compatibility, mainly). Any interest in
adding it to Biopython?
The code is below.
[*] Henikoff & Henikoff (1994): Position-based sequence weights.
"""Weight aligned sequences to emphasize more divergent members.
Returns a list of floating-point numbers between 0 and 1, corresponding
the proportional weight of each sequence in the alignment. The first
is the weight of the first sequence in the alignment, and so on. Weights
sum to 1.0.
Method: At each column position, award each different residue an equal
share of the weight, and then divide that weight equally among the
sequences sharing the same residue. For each sequence, sum the
contributions from each position to give a sequence weight.
See Henikoff & Henikoff (1994): Position-based sequence weights.
"""Represent the diversity at a position.
Award each different residue an equal share of the weight, and then
divide that weight equally among the sequences sharing the same
So, if in a position of a multiple alignment, r different residues
are represented, a residue represented in only one sequence
a score of 1/r to that sequence, whereas a residue represented in s
sequences contributes a score of 1/rs to each of the s sequences.
# Skip columns with all gaps or unique inserts
if len([c for c in column if c not in '-.']) < 2:
return  * len(column)
# Count the number of occurrences of each residue type
# (Treat gaps as a separate, 21st character)
counts = Counter(column)
# Get residue weights: 1/rs, where
# r = nb. residue types, s = count of a particular residue type
n_residues = len(counts) # r
freqs = dict((aa, 1.0 / (n_residues * count))
for aa, count in counts.iteritems())
weights = [freqs[aa] for aa in column]
seq_weights =  * len(aln)
col_weights = map(col_weight, zip(*aln))
# Sum the contributions from each position along each sequence -> total
for col in col_weights:
for idx, row_val in enumerate(col):
seq_weights[idx] += row_val
scale = 1.0 / sum(seq_weights)
seq_weights = [scale * wt for wt in seq_weights]
More information about the Biopython-dev