[Bioperl-l] Re: Another update for my alignment module

Aaron J Mackey ajm6q at virginia.edu
Wed May 21 09:51:17 EDT 2003


Hi Yee, overall this looks fine (although it duplicates some of what's
already in Bio::Ext::Align).  Can you imagine integrating your stuff with
what's already in Bio::Tools::pSW (or better, having your own
Bio::Tools::dpAlign, that mimics the behavior of pSW)?  I see your
$factory->pairwise_alignment($seq1, $seq2) already does look like pSW, so
that's great.

I'd love to see the first thing on your TODO list, combined with a
Bio::Matrix::SimilarityScore matrix object (wasn't someone working on
that?).

What I think would also be a spectacular item for your TODO list is to
calculate, for a given match/mismatch or scoring matrix, the expected
average score per residue, and to throw a warning of it's > 0 (and thus
not likely to generate local alignments).  This bites many new users
playing with DP algorithms and scoring parameters.

> 	It is accepted
> 	that Miller-Myers implementation is the fastest and using the
> 	least memory that is truly equivalent to original algorithm
> 	introduced by Needleman-Wunsch.

I think that the developers of vector and other parallel-processed
DP algorithms would disagree with it being the fastest.

>       Phil Green (?? yr) introduced
> 	heuristics to skip the calculation of some cells.

I would restate this as something like "Phil Green's SWAT implementation
of the Smith-Waterman algorithm introduced a heuristic that does not
consider paths through the matrix where the score would be less than the
gap open penalty, yielding a 1.5-2X speedup on most comparisons".


>       However,
> 	his approach is only good for calculating the minimum edit
> 	distance and find out the corresponding subsequences (aka
> 	search phase).

As I said before, this simply isn't true.  The Miller-Myers divide and
conquer still must use some scoring scheme to calculate the "pivot point"
where the alignment path crosses the joining row.  There is nothing
stopping you from using the Phil Green SWAT optimization/heuristic to
calculate those scores and paths.  The reason Bill Pearson's ssearch
doesn't use this optimization during the alignment phase is that speed is
not critical during the alignment phase, only the search phase.

>       The most popular dynamic programming alignment
> 	program ssearch uses Phil Green's algorithm to find the
> 	subsequences and then Miller-Myers's algorithm to find the
> 	actual alignment. (aka alignment phase)

Since you've already done a very good job of giving credit where it's due,
I'd say "Bill Pearson's popular DP alignment program SSEARCH uses ... "

> 	It allows you
> 	to specify either the Phil Green (DPALIGN_LOCAL_GREEN)
> 	or Miller-Myers (DPALIGN_LOCAL_MILLER_MYERS).

You've already said that the Phil Green optimized SW can't be used to
generate an alignment (which is untrue for the reason's I've already
given) - what does DPALIGN_LOCAL_GREEN do then?

> 	5) For DNA sequences, provides an option to run reverse
> 	complement search.

Let the user do this by running pairwise_alignment($seq1, $seq2->revcomp);

> 	6) Support six frames alignment between a DNA sequence and
> 	a protein sequence.

Again, it's almost better to let the user do this themselves in bioperl
space (keep your code and interface simpler):

for $frame (0 .. 2) {
  push @alns,
    $factory->pairwise_alignment($dna->translate(undef, undef, $frame), $prot);
}

$dna = $dna->revcomp;
# and repeat above for loop.

As a side note, you need to make sure your similarity matrix has scores
for "B", "Z", "X" and "*", if it doesn't already.

-Aaron



More information about the Bioperl-l mailing list