[Biojava-l] GSoC project on MSA

Andreas Prlic andreas at sdsc.edu
Wed Apr 7 19:12:27 UTC 2010


Hi Gustavo,

here my 0.02$:

* For some of your steps there is already code available in BioJava.
MIght be good to take a look at what is already there...   (look at
the alignment and phylo modules for dynamic programming and
Neighbour-Joining)

* What about risks? Where do you expect difficulties and how to work
around them?

* Step 4: Can you add more details? How do you plan to approach this?
E.g. Clustalw has a number of rules implemented at this stage. Do you
plan to support multiple rules as well and how to do this technically.
Something nice would be the possibility to use structure alignments to
guide the sequence alignments. (structure module)

Andreas


> -------------------------------------------------------------
>
> GSoC proposal
>
> Abstract
> --------
>
> This project aims to develop an all-Java implementation of a multiple
> sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
> using the progressive algorithm described in the CLUSTALW paper [1].
>
> The Importance
> --------------
>
> Multiple sequence alignment is a frequently performed task in sequence
> analysis with the goal to identify new members of protein families and
> infer phylogenetic relationships between proteins and genes. At the
> present there is no Java-only implementation for this algorithm. As
> such the number of already existing and Java related BioInformatics
> tools and web sites would benefit from this implementation and
> sequence analysis could be more easily performed by the end-user.
>
> About Me
> --------
>
> I am a graduate student at University of São Paulo (Brazil), I got my
> undergraduate degree from the same university with a major in Computer
> Science and a minor in Biology. I have been involved with
> Bioinformatics for 5 years, always with sequence analysis with
> particular interest in the MSA problem. Also, in my undergraduate
> final project I developed a lossless filter (pruning algorithm) for
> the MSA problem, the work is published in [3] and there is an online
> implementation of the algorithm in [4]. Finally, I have experience
> with the C, C++, Java, Python and Ruby programming languages; Git and
> SVN version control systems.
>
> Project Plan
> ------------
>
> The project is divided in four main steps, at the end of each step a
> completely functional and bug-free new algorithm will be added to the
> Biojava code base. It should be noticed that each step has a strong
> dependence on the previous one, so before move to the next step a
> careful testing will be done.
>
> The four steps are described below, estimated times for accomplishment
> of each step are also given and in some steps extra enhancements are
> described, they will be implemented if there is some time remaining
> after all steps are completed.
>
> ** 1. Study the Biojava pairwise alignment code and update it to be
> compliant with Biojava 3.
>
>  The pairwise alignment will play an important role in the MSA
> algorithm. This step is also important for me to get used to the
> Biojava coding standards and get in touch with the Biojava dev
> community.
>
>  ETA: 2 weeks.
>
> ** 2. Implement the algorithm to build the distance matrix.
>
>  This is done using the pairwise alignment for each pair of sequence
> in the set to be aligned.
>
>  ETA: 1 week.
>
>  EXTRA: Enhance the basic algorithm to use parallel strategies, use
> several threads to calculate the pairwise alignment for different
> pairs in the sequence set.
>
> ** 3. Implement the algorithm to build the guide tree.
>
>  The guide tree is based on the distance matrix built in the last
> step, the tree construction strategy adopted will be the Neighbor
> Joining Algorithm.
>
>  ETA: 2 weeks.
>
> ** 4. Implement the algorithm for progressive MSA using the guide tree.
>
>  This is certainly the most difficult part of the project, so to make
> sure we are going to deliver a fully functional MSA algorithm, a safer
> approach is going to be taken. In the first place, a dynamic
> programming algorithm described in [2] will be implemented. Once this
> get successfully done and the code fully integrated to the Biojava
> code base, the features described in [1] are going to be incrementally
> added (and tested) in order to implement the full dynamic programming
> algorithm.
>
>  ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>
>  EXTRA: Implement some benchmark technique to measure the final
> alignment quality.
>
> References
> ----------
>
> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
> [3] http://www.almob.org/content/4/1/3
> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu
>
>
>
> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Gustavo,
>>
>> In principle I agree to all, see details below:
>>
>>
>> I think my question wasn't very clear, my intention in this project is
>>>
>>> to follow the approach (with the tree steps) outlined in the project's
>>> page. Using the classical progressive alignment heuristic: build the
>>> distance matrix, build the guide tree and using this tree
>>> progressively align more sequences together.
>>
>> yes
>>
>>>
>>> What I propose for the third step is a first implementation using the
>>> (more simple) dynamic programming described in the first CLUSTAL paper
>>> (I thinks it's from 1988) and incrementally improving the algorithm to
>>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>>> more or less what you had in mind?
>>
>> yes, sounds good.
>>
>>>
>>> About parallel strategies, I think a relative easy way we could use it
>>> is in the distance matrix construction, we could have several threads
>>> calculating the pairwise alignment for different pairs of sequence in
>>> the set.
>>
>> Correct. Probably a first implementation would be for a single machine/
>> multi CPU. More advanced implementations could provide support e.g. for
>> Map/Reduce, JPPF, or something like that...
>>
>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>>> paper doesn't give any way to measure the quality of the result, they
>>> consider a good alignment the one that is hard to improve by eye (But
>>> they claim that for sequences sufficient similar, no pair less than
>>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>>> paper and leave the quality measure to the user? How concerned should
>>> I be with that in this project?
>>
>> Getting an overall core-algorithm that works should be priority. The
>> benchmarking part is not mandatory, but something to keep in mind... I have
>> plenty of material for that, once we get to that stage...
>>
>>> I will try send to this mailing list a proposal draft until tomorrow
>>> to have some feedback from you.
>>
>> Excellent, looking forward to it.
>>
>> Andreas
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>



-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------




More information about the Biojava-l mailing list