[Biojava-l] GSoC project on MSA

Tue Apr 6 21:27:15 UTC 2010

Hi Gustavo,

In principle I agree to all, see details below:

I think my question wasn't very clear, my intention in this project is

> to follow the approach (with the tree steps) outlined in the project's
> page. Using the classical progressive alignment heuristic: build the
> distance matrix, build the guide tree and using this tree
> progressively align more sequences together.
>

yes

>
> What I propose for the third step is a first implementation using the
> (more simple) dynamic programming described in the first CLUSTAL paper
> (I thinks it's from 1988) and incrementally improving the algorithm to
> get closer to the one described in CLUSTALW paper (from 1994). Is this
> more or less what you had in mind?
>

yes, sounds good.

>
> About parallel strategies, I think a relative easy way we could use it
> is in the distance matrix construction, we could have several threads
> calculating the pairwise alignment for different pairs of sequence in
> the set.
>

Correct. Probably a first implementation would be for a single machine/
multi CPU. More advanced implementations could provide support e.g. for
Map/Reduce, JPPF, or something like that...

Now, the alignment quality measures is a tougher issue. The CLUSTALW
> paper doesn't give any way to measure the quality of the result, they
> consider a good alignment the one that is hard to improve by eye (But
> they claim that for sequences sufficient similar, no pair less than
> 35% identical, the results are good). Can I do the same as in CLUSTALW
> paper and leave the quality measure to the user? How concerned should
> I be with that in this project?
>

Getting an overall core-algorithm that works should be priority. The
benchmarking part is not mandatory, but something to keep in mind... I have
plenty of material for that, once we get to that stage...

 I will try send to this mailing list a proposal draft until tomorrow
> to have some feedback from you.
>

Excellent, looking forward to it.

Andreas

-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------