[Biopython-dev] Method to weight sequences in an alignment

Mon Apr 9 18:25:04 UTC 2012

Folks,

I've written a function to weight sequences according to the simple scheme
used in PSI-BLAST [*]. It operates on Bio.Align.MultipleSeqAlignment
objects or lists of plain strings, and could be added as a method with
minimal changes (for Python 2.5 compatibility, mainly). Any interest in
adding it to Biopython?

The code is below.

Cheers,
Eric

[*] Henikoff & Henikoff (1994): Position-based sequence weights.
http://www.ncbi.nlm.nih.gov/pubmed/7966282

----

def sequence_weights(aln):
    """Weight aligned sequences to emphasize more divergent members.

    Returns a list of floating-point numbers between 0 and 1, corresponding
to
    the proportional weight of each sequence in the alignment. The first
list
    is the weight of the first sequence in the alignment, and so on. Weights
    sum to 1.0.

    Method: At each column position, award each different residue an equal
    share of the weight, and then divide that weight equally among the
    sequences sharing the same residue.  For each sequence, sum the
    contributions from each position to give a sequence weight.

    See Henikoff & Henikoff (1994): Position-based sequence weights.
    """
    def col_weight(column):
        """Represent the diversity at a position.

        Award each different residue an equal share of the weight, and then
        divide that weight equally among the sequences sharing the same
        residue.

        So, if in a position of a multiple alignment, r different residues
        are represented, a residue represented in only one sequence
contributes
        a score of 1/r to that sequence, whereas a residue represented in s
        sequences contributes a score of 1/rs to each of the s sequences.
        """
        # Skip columns with all gaps or unique inserts
        if len([c for c in column if c not in '-.']) < 2:
            return [0] * len(column)
        # Count the number of occurrences of each residue type
        # (Treat gaps as a separate, 21st character)
        counts = Counter(column)
        # Get residue weights: 1/rs, where
        # r = nb. residue types, s = count of a particular residue type
        n_residues = len(counts)    # r
        freqs = dict((aa, 1.0 / (n_residues * count))
                for aa, count in counts.iteritems())
        weights = [freqs[aa] for aa in column]
        return weights

    seq_weights = [0] * len(aln)
    col_weights = map(col_weight, zip(*aln))
    # Sum the contributions from each position along each sequence -> total
weight
    for col in col_weights:
        for idx, row_val in enumerate(col):
            seq_weights[idx] += row_val
    # Normalize
    scale = 1.0 / sum(seq_weights)
    seq_weights = [scale * wt for wt in seq_weights]
    return seq_weights