[Biojava-l] GSoC project on MSA

Wed Apr 7 05:29:31 UTC 2010

Hi Andreas,

My proposal is pasted at the end of this e-mail.

I'm waiting for your feedback.

Thanks,

gustavo

-------------------------------------------------------------

GSoC proposal

Abstract
--------

This project aims to develop an all-Java implementation of a multiple
sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
using the progressive algorithm described in the CLUSTALW paper [1].

The Importance
--------------

Multiple sequence alignment is a frequently performed task in sequence
analysis with the goal to identify new members of protein families and
infer phylogenetic relationships between proteins and genes. At the
present there is no Java-only implementation for this algorithm. As
such the number of already existing and Java related BioInformatics
tools and web sites would benefit from this implementation and
sequence analysis could be more easily performed by the end-user.

About Me
--------

I am a graduate student at University of São Paulo (Brazil), I got my
undergraduate degree from the same university with a major in Computer
Science and a minor in Biology. I have been involved with
Bioinformatics for 5 years, always with sequence analysis with
particular interest in the MSA problem. Also, in my undergraduate
final project I developed a lossless filter (pruning algorithm) for
the MSA problem, the work is published in [3] and there is an online
implementation of the algorithm in [4]. Finally, I have experience
with the C, C++, Java, Python and Ruby programming languages; Git and
SVN version control systems.

Project Plan
------------

The project is divided in four main steps, at the end of each step a
completely functional and bug-free new algorithm will be added to the
Biojava code base. It should be noticed that each step has a strong
dependence on the previous one, so before move to the next step a
careful testing will be done.

The four steps are described below, estimated times for accomplishment
of each step are also given and in some steps extra enhancements are
described, they will be implemented if there is some time remaining
after all steps are completed.

** 1. Study the Biojava pairwise alignment code and update it to be
compliant with Biojava 3.

 The pairwise alignment will play an important role in the MSA
algorithm. This step is also important for me to get used to the
Biojava coding standards and get in touch with the Biojava dev
community.

 ETA: 2 weeks.

** 2. Implement the algorithm to build the distance matrix.

 This is done using the pairwise alignment for each pair of sequence
in the set to be aligned.

 ETA: 1 week.

 EXTRA: Enhance the basic algorithm to use parallel strategies, use
several threads to calculate the pairwise alignment for different
pairs in the sequence set.

** 3. Implement the algorithm to build the guide tree.

 The guide tree is based on the distance matrix built in the last
step, the tree construction strategy adopted will be the Neighbor
Joining Algorithm.

 ETA: 2 weeks.

** 4. Implement the algorithm for progressive MSA using the guide tree.

 This is certainly the most difficult part of the project, so to make
sure we are going to deliver a fully functional MSA algorithm, a safer
approach is going to be taken. In the first place, a dynamic
programming algorithm described in [2] will be implemented. Once this
get successfully done and the code fully integrated to the Biojava
code base, the features described in [1] are going to be incrementally
added (and tested) in order to implement the full dynamic programming
algorithm.

 ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.

 EXTRA: Implement some benchmark technique to measure the final
alignment quality.

References
----------

[1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
[2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
[3] http://www.almob.org/content/4/1/3
[4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu

On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Gustavo,
>
> In principle I agree to all, see details below:
>
>
> I think my question wasn't very clear, my intention in this project is
>>
>> to follow the approach (with the tree steps) outlined in the project's
>> page. Using the classical progressive alignment heuristic: build the
>> distance matrix, build the guide tree and using this tree
>> progressively align more sequences together.
>
> yes
>
>>
>> What I propose for the third step is a first implementation using the
>> (more simple) dynamic programming described in the first CLUSTAL paper
>> (I thinks it's from 1988) and incrementally improving the algorithm to
>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>> more or less what you had in mind?
>
> yes, sounds good.
>
>>
>> About parallel strategies, I think a relative easy way we could use it
>> is in the distance matrix construction, we could have several threads
>> calculating the pairwise alignment for different pairs of sequence in
>> the set.
>
> Correct. Probably a first implementation would be for a single machine/
> multi CPU. More advanced implementations could provide support e.g. for
> Map/Reduce, JPPF, or something like that...
>
>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>> paper doesn't give any way to measure the quality of the result, they
>> consider a good alignment the one that is hard to improve by eye (But
>> they claim that for sequences sufficient similar, no pair less than
>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>> paper and leave the quality measure to the user? How concerned should
>> I be with that in this project?
>
> Getting an overall core-algorithm that works should be priority. The
> benchmarking part is not mandatory, but something to keep in mind... I have
> plenty of material for that, once we get to that stage...
>
>> I will try send to this mailing list a proposal draft until tomorrow
>> to have some feedback from you.
>
> Excellent, looking forward to it.
>
> Andreas
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>