[Bioperl-l] Re: Another update for my alignment module

Wed May 28 13:23:55 EDT 2003

Hi Aaron and Ewan,

	I implemented the function align_and_show by calling Ewan's 
write_pretty_seq_align. I believe now functionality-wise, my code is on
par of Ewan's code. I think in the next step, I will need to merge my
Align.xs and Ewan's Align.xs. How can I do that? My updated code is at 

http://www.stanford.edu/~yeeman/dpalign.tgz

Thanks
Yee Man

On Wed, 21 May 2003, Aaron J Mackey wrote:

> 
> Hi Yee, overall this looks fine (although it duplicates some of what's
> already in Bio::Ext::Align).  Can you imagine integrating your stuff with
> what's already in Bio::Tools::pSW (or better, having your own
> Bio::Tools::dpAlign, that mimics the behavior of pSW)?  I see your
> $factory->pairwise_alignment($seq1, $seq2) already does look like pSW, so
> that's great.
> 
> I'd love to see the first thing on your TODO list, combined with a
> Bio::Matrix::SimilarityScore matrix object (wasn't someone working on
> that?).
> 
> What I think would also be a spectacular item for your TODO list is to
> calculate, for a given match/mismatch or scoring matrix, the expected
> average score per residue, and to throw a warning of it's > 0 (and thus
> not likely to generate local alignments).  This bites many new users
> playing with DP algorithms and scoring parameters.
> 
> > 	It is accepted
> > 	that Miller-Myers implementation is the fastest and using the
> > 	least memory that is truly equivalent to original algorithm
> > 	introduced by Needleman-Wunsch.
> 
> I think that the developers of vector and other parallel-processed
> DP algorithms would disagree with it being the fastest.
> 
> >       Phil Green (?? yr) introduced
> > 	heuristics to skip the calculation of some cells.
> 
> I would restate this as something like "Phil Green's SWAT implementation
> of the Smith-Waterman algorithm introduced a heuristic that does not
> consider paths through the matrix where the score would be less than the
> gap open penalty, yielding a 1.5-2X speedup on most comparisons".
> 
> 
> >       However,
> > 	his approach is only good for calculating the minimum edit
> > 	distance and find out the corresponding subsequences (aka
> > 	search phase).
> 
> As I said before, this simply isn't true.  The Miller-Myers divide and
> conquer still must use some scoring scheme to calculate the "pivot point"
> where the alignment path crosses the joining row.  There is nothing
> stopping you from using the Phil Green SWAT optimization/heuristic to
> calculate those scores and paths.  The reason Bill Pearson's ssearch
> doesn't use this optimization during the alignment phase is that speed is
> not critical during the alignment phase, only the search phase.
> 
> >       The most popular dynamic programming alignment
> > 	program ssearch uses Phil Green's algorithm to find the
> > 	subsequences and then Miller-Myers's algorithm to find the
> > 	actual alignment. (aka alignment phase)
> 
> Since you've already done a very good job of giving credit where it's due,
> I'd say "Bill Pearson's popular DP alignment program SSEARCH uses ... "
> 
> > 	It allows you
> > 	to specify either the Phil Green (DPALIGN_LOCAL_GREEN)
> > 	or Miller-Myers (DPALIGN_LOCAL_MILLER_MYERS).
> 
> You've already said that the Phil Green optimized SW can't be used to
> generate an alignment (which is untrue for the reason's I've already
> given) - what does DPALIGN_LOCAL_GREEN do then?
> 
> > 	5) For DNA sequences, provides an option to run reverse
> > 	complement search.
> 
> Let the user do this by running pairwise_alignment($seq1, $seq2->revcomp);
> 
> > 	6) Support six frames alignment between a DNA sequence and
> > 	a protein sequence.
> 
> Again, it's almost better to let the user do this themselves in bioperl
> space (keep your code and interface simpler):
> 
> for $frame (0 .. 2) {
>   push @alns,
>     $factory->pairwise_alignment($dna->translate(undef, undef, $frame), $prot);
> }
> 
> $dna = $dna->revcomp;
> # and repeat above for loop.
> 
> As a side note, you need to make sure your similarity matrix has scores
> for "B", "Z", "X" and "*", if it doesn't already.
> 
> -Aaron
>