[Biojava-l] GSoC project on MSA

Gustavo Akio Tominaga Sacomoto sacomoto at gmail.com
Tue Apr 6 18:53:04 UTC 2010


Hello Andreas,

On Tue, Apr 6, 2010 at 2:46 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Gustavo,
>
> With straightforward I meant that we only have 3 months for this project and
> we should not try to solve all problems at the same time. Probably a
> realistic approach is to start with trying to keep things modular and simple
> (think interfaces and implementations) and stick to standard solutions that
> have been shown to work elsewhere. If there is more time in the project one
> can then replace some of the implementations with technically more advanced
> ones.

I think my question wasn't very clear, my intention in this project is
to follow the approach (with the tree steps) outlined in the project's
page. Using the classical progressive alignment heuristic: build the
distance matrix, build the guide tree and using this tree
progressively align more sequences together.

What I propose for the third step is a first implementation using the
(more simple) dynamic programming described in the first CLUSTAL paper
(I thinks it's from 1988) and incrementally improving the algorithm to
get closer to the one described in CLUSTALW paper (from 1994). Is this
more or less what you had in mind?

> Since we are doing things in Java I am interested in having support for
> parallelisation wherever possible. Another issue is how to verify that the
> created alignments are meaningful. One could e.g. use the biojava structure
> modules to calculate protein structure alignments to verify the quality of
> the obtained multiple sequence alignments.

About parallel strategies, I think a relative easy way we could use it
is in the distance matrix construction, we could have several threads
calculating the pairwise alignment for different pairs of sequence in
the set.

Now, the alignment quality measures is a tougher issue. The CLUSTALW
paper doesn't give any way to measure the quality of the result, they
consider a good alignment the one that is hard to improve by eye (But
they claim that for sequences sufficient similar, no pair less than
35% identical, the results are good). Can I do the same as in CLUSTALW
paper and leave the quality measure to the user? How concerned should
I be with that in this project?

> All applications have to be made via Google. We are providing comments  on
> drafts of proposals and try to work together with applicants to improve the
> submissions. Note: The application deadline is soon and speed is important
> now.

I will try send to this mailing list a proposal draft until tomorrow
to have some feedback from you.

> Andreas
>
>
>
> On Mon, Apr 5, 2010 at 10:29 PM, Gustavo Akio Tominaga Sacomoto
> <sacomoto at gmail.com> wrote:
>>
>> Hello,
>>
>> I'm currently a graduate student at University of São Paulo (Brazil)
>> and I'm quite interested in applying for the all-Java MSA project. I'm
>> already familiar with the multiple sequence alignment problem, I
>> developed a lossless filter for this problem as my undergraduate final
>> project, the work is described here
>> [http://www.almob.org/content/4/1/3] and there is an online version of
>> the algorithm here
>> [http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu].
>>
>> Now, regarding the project, just to make it clear, when you say in the
>> "straightforward approach for building up the MSA progressively", you
>> mean the standard dynamic programming approach for pairwise alignment
>> following the guide tree built in the second step, right?
>>
>> One last question, should I send my proposal direct to the Google's
>> web app or here first?
>>
>> Thanks,
>>
>> Gustavo Sacomoto
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>

Thanks for your help.

gustavo




More information about the Biojava-l mailing list