[Biojava-l] GSoC project on MSA

Thu Apr 8 17:36:56 UTC 2010

Looks pretty good.

One issue during the progressive alignment build up: 3D structure
alignments can increase the reliability of the sequence alignments,
particularly if the sequences are only distantly related. Having a way
to incorporate the 3D structure info would be nice...

Andreas

On Thu, Apr 8, 2010 at 9:26 AM, Gustavo Akio Tominaga Sacomoto
<sacomoto at gmail.com> wrote:
> Hi Andreas,
>
> On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Gustavo,
>>
>> here my 0.02$:
>>
>> * For some of your steps there is already code available in BioJava.
>> MIght be good to take a look at what is already there...   (look at
>> the alignment and phylo modules for dynamic programming and
>> Neighbour-Joining)
>>
>> * What about risks? Where do you expect difficulties and how to work
>> around them?
>>
>> * Step 4: Can you add more details? How do you plan to approach this?
>> E.g. Clustalw has a number of rules implemented at this stage. Do you
>> plan to support multiple rules as well and how to do this technically.
>> Something nice would be the possibility to use structure alignments to
>> guide the sequence alignments. (structure module)
>
> Based on it I rewrote the step 4 and add a "Main Risks" section.
>
> I pasted just the new version of step 4 and the new section at the end
> of this e-mal.
>
> Thank you very much for your feedback.
>
> gustavo
>
>
>
> -------------------------------------------------------------------------------------------
>
> ** 4. Implement the algorithm for progressive MSA and the MSA wrapper.
>
>  A progressive MSA is a heuristic approach for the MSA problem, at
> each step a pairwise alignment between two sequences, a sequence and
> an alignment or between two alignments is done. So, the multiple
> alignment is built incrementally, at each iteration more sequences are
> aligned together. The guide tree gives an order for this incremental
> alignment, in a bottom-up (in the tree) fashion sequences (or groups
> of sequences) with greater similarity are aligned first. Therefore, in
> order to have a more flexible and reusable code, the code design will
> allow any binary tree of the sequences to be used as a guide tree, not
> only the one built in the last step. This will allow a priori
> phylogenetic or tertiary similarity (structural similarity) knowledge
> be used to guide the multiple alignment order.
>
>  This is certainly the most difficult part of the project, so to make
> sure we are going to deliver a fully functional MSA algorithm, a safer
> approach is going to be taken. In the first place, a a basic algorithm
> described in [2] will be implemented. Once this get successfully done
> and the code fully integrated to the Biojava code base, the features
> described in [1] are going to be incrementally added (and tested) in
> order to implement the full algorithm. This step is further divided in
> substeps.
>
> *** 4.1 Implement a first simpler dynamic programming (DP) algorithm.
>
>  This is the generalized pairwise alignment used in each iteration of
> the progressive MSA. Gaps  already presents in one of the alignments
> (profiles) remain fixed, gap opening penalties remain unchanged, this
> means that opening new gaps inside existent gaps will be fully
> penalized. The code for this algorithm is similar to, the already
> present in Biojava, code for regular pairwise alignment.
>
> *** 4.2 Implement the basic progressive MSA algorithm.
>
>  In this substep is going to be implemented the incremental algorithm
> to built the MSA, transversing a guide tree (parameter, could be the
> one built in step 3 or any other one) in a bottom-up fashion and using
> the algorithm from substep 4.1 at each iteration.
>
> *** 4.3 Implement the MSA wrapper.
>
>  The MSA wrapper is going to be a method that wraps steps 2, 3 and
> 4.2, giving a simple method (for the final user) to calculate the MSA.
> Receiving as parameters the set of sequences to be aligned, the gap
> opening penalty, gap extend penalty and residue matrix. Returning the
> MSA for the sequence set.
>  At the end of this substep, we get a basic fully functional MSA
> algorithm, using the progressive heuristic.
>
> *** 4.4 Implement gaps penalties rescaling and parameter default values.
>
>  Gap penalties to open a new gap an extend a existing one (the affine
> gap weight model) are user defined parameters. This substep will
> define default values, based on the residue matrix, for this
> parameters and implement global rescaling rules (based on sequences
> sizes) for this parameters.
>
> *** 4.5 Enhance the DP algorithm to use different sequences weight.
>
>  Based on the guide tree, for each sequence a different weight
> (divergent sequences receive high values) is calculated and used in
> the scoring scheme of the generalized DP algorithm.
>
> *** 4.6 Enhance the DP algorithm to use position based gap penalties.
>
>  The DP algorithm from substep 4.1 uses globally defined gap opening
> penalty. In this substep, the algorithm is going to be modified do use
> position based penalty, this is simple, once is known an array of
> opening penalties for each sequence position. This array is calculated
> based on several hierarchical (only apply the first one that fits, if
> any) rules, those are rescaling rules and the array is initialized
> with the original gap opening penalty.
>
> Given the hierarchical nature of the rules, they can be implemented in
> a incremental way, from the highest priority rule to the lowest, the
> algorithm of each step being a refinement of the previous one. I am
> omitting the detailed description of each rule. However, to verify if
> a given rule apply to a given position, all that is necessary is to
> check at most 16 adjacent positions and the same position in the other
> already aligned sequences.
>
> At the end of each of the following steps we a have functional
> algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete.
>
> **** 4.6.1 Lowered gap opening penalties at existing gaps.
> **** 4.6.2 Increased gap opening penalties near existing gaps.
> **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches.
> **** 4.6.4 Residue specific gap penalties.
>
>  ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>
>  EXTRA: Implement some benchmark technique to measure the final
> alignment quality.
>
> Main Risks
> ----------
>
> The main risk to this project is the intrinsic complexity of the MSA
> progressive algorithm. To deal with that we decided to break the
> implementation in a large number of small and manageable steps, and
> the steps are designed in a way that, at the end of each of them, we
> will have a complete and testable new function (or a modification of
> an existing one). Besides that, to be extra careful the project aims
> to produce a simple full functional MSA algorithm as early as
> possible, the estimated time is 8 weeks, this way we guarantee to
> deliver at a simpler, but working and bug-free, version.
>
>
>
>
>> Andreas
>>
>>
>>> -------------------------------------------------------------
>>>
>>> GSoC proposal
>>>
>>> Abstract
>>> --------
>>>
>>> This project aims to develop an all-Java implementation of a multiple
>>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
>>> using the progressive algorithm described in the CLUSTALW paper [1].
>>>
>>> The Importance
>>> --------------
>>>
>>> Multiple sequence alignment is a frequently performed task in sequence
>>> analysis with the goal to identify new members of protein families and
>>> infer phylogenetic relationships between proteins and genes. At the
>>> present there is no Java-only implementation for this algorithm. As
>>> such the number of already existing and Java related BioInformatics
>>> tools and web sites would benefit from this implementation and
>>> sequence analysis could be more easily performed by the end-user.
>>>
>>> About Me
>>> --------
>>>
>>> I am a graduate student at University of São Paulo (Brazil), I got my
>>> undergraduate degree from the same university with a major in Computer
>>> Science and a minor in Biology. I have been involved with
>>> Bioinformatics for 5 years, always with sequence analysis with
>>> particular interest in the MSA problem. Also, in my undergraduate
>>> final project I developed a lossless filter (pruning algorithm) for
>>> the MSA problem, the work is published in [3] and there is an online
>>> implementation of the algorithm in [4]. Finally, I have experience
>>> with the C, C++, Java, Python and Ruby programming languages; Git and
>>> SVN version control systems.
>>>
>>> Project Plan
>>> ------------
>>>
>>> The project is divided in four main steps, at the end of each step a
>>> completely functional and bug-free new algorithm will be added to the
>>> Biojava code base. It should be noticed that each step has a strong
>>> dependence on the previous one, so before move to the next step a
>>> careful testing will be done.
>>>
>>> The four steps are described below, estimated times for accomplishment
>>> of each step are also given and in some steps extra enhancements are
>>> described, they will be implemented if there is some time remaining
>>> after all steps are completed.
>>>
>>> ** 1. Study the Biojava pairwise alignment code and update it to be
>>> compliant with Biojava 3.
>>>
>>>  The pairwise alignment will play an important role in the MSA
>>> algorithm. This step is also important for me to get used to the
>>> Biojava coding standards and get in touch with the Biojava dev
>>> community.
>>>
>>>  ETA: 2 weeks.
>>>
>>> ** 2. Implement the algorithm to build the distance matrix.
>>>
>>>  This is done using the pairwise alignment for each pair of sequence
>>> in the set to be aligned.
>>>
>>>  ETA: 1 week.
>>>
>>>  EXTRA: Enhance the basic algorithm to use parallel strategies, use
>>> several threads to calculate the pairwise alignment for different
>>> pairs in the sequence set.
>>>
>>> ** 3. Implement the algorithm to build the guide tree.
>>>
>>>  The guide tree is based on the distance matrix built in the last
>>> step, the tree construction strategy adopted will be the Neighbor
>>> Joining Algorithm.
>>>
>>>  ETA: 2 weeks.
>>>
>>> ** 4. Implement the algorithm for progressive MSA using the guide tree.
>>>
>>>  This is certainly the most difficult part of the project, so to make
>>> sure we are going to deliver a fully functional MSA algorithm, a safer
>>> approach is going to be taken. In the first place, a dynamic
>>> programming algorithm described in [2] will be implemented. Once this
>>> get successfully done and the code fully integrated to the Biojava
>>> code base, the features described in [1] are going to be incrementally
>>> added (and tested) in order to implement the full dynamic programming
>>> algorithm.
>>>
>>>  ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>>>
>>>  EXTRA: Implement some benchmark technique to measure the final
>>> alignment quality.
>>>
>>> References
>>> ----------
>>>
>>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
>>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
>>> [3] http://www.almob.org/content/4/1/3
>>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu
>>>
>>>
>>>
>>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> Hi Gustavo,
>>>>
>>>> In principle I agree to all, see details below:
>>>>
>>>>
>>>> I think my question wasn't very clear, my intention in this project is
>>>>>
>>>>> to follow the approach (with the tree steps) outlined in the project's
>>>>> page. Using the classical progressive alignment heuristic: build the
>>>>> distance matrix, build the guide tree and using this tree
>>>>> progressively align more sequences together.
>>>>
>>>> yes
>>>>
>>>>>
>>>>> What I propose for the third step is a first implementation using the
>>>>> (more simple) dynamic programming described in the first CLUSTAL paper
>>>>> (I thinks it's from 1988) and incrementally improving the algorithm to
>>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>>>>> more or less what you had in mind?
>>>>
>>>> yes, sounds good.
>>>>
>>>>>
>>>>> About parallel strategies, I think a relative easy way we could use it
>>>>> is in the distance matrix construction, we could have several threads
>>>>> calculating the pairwise alignment for different pairs of sequence in
>>>>> the set.
>>>>
>>>> Correct. Probably a first implementation would be for a single machine/
>>>> multi CPU. More advanced implementations could provide support e.g. for
>>>> Map/Reduce, JPPF, or something like that...
>>>>
>>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>>>>> paper doesn't give any way to measure the quality of the result, they
>>>>> consider a good alignment the one that is hard to improve by eye (But
>>>>> they claim that for sequences sufficient similar, no pair less than
>>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>>>>> paper and leave the quality measure to the user? How concerned should
>>>>> I be with that in this project?
>>>>
>>>> Getting an overall core-algorithm that works should be priority. The
>>>> benchmarking part is not mandatory, but something to keep in mind... I have
>>>> plenty of material for that, once we get to that stage...
>>>>
>>>>> I will try send to this mailing list a proposal draft until tomorrow
>>>>> to have some feedback from you.
>>>>
>>>> Excellent, looking forward to it.
>>>>
>>>> Andreas
>>>>
>>>> --
>>>> -----------------------------------------------------------------------
>>>> Dr. Andreas Prlic
>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>> University of California, San Diego
>>>> (+1) 858.246.0526
>>>> -----------------------------------------------------------------------
>>>>
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>

-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------