[Biojava-l] GSoC project on MSA

Thu Apr 8 16:26:55 UTC 2010

Hi Andreas,

On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Gustavo,
>
> here my 0.02$:
>
> * For some of your steps there is already code available in BioJava.
> MIght be good to take a look at what is already there...   (look at
> the alignment and phylo modules for dynamic programming and
> Neighbour-Joining)
>
> * What about risks? Where do you expect difficulties and how to work
> around them?
>
> * Step 4: Can you add more details? How do you plan to approach this?
> E.g. Clustalw has a number of rules implemented at this stage. Do you
> plan to support multiple rules as well and how to do this technically.
> Something nice would be the possibility to use structure alignments to
> guide the sequence alignments. (structure module)

Based on it I rewrote the step 4 and add a "Main Risks" section.

I pasted just the new version of step 4 and the new section at the end
of this e-mal.

Thank you very much for your feedback.

gustavo

-------------------------------------------------------------------------------------------

** 4. Implement the algorithm for progressive MSA and the MSA wrapper.

 A progressive MSA is a heuristic approach for the MSA problem, at
each step a pairwise alignment between two sequences, a sequence and
an alignment or between two alignments is done. So, the multiple
alignment is built incrementally, at each iteration more sequences are
aligned together. The guide tree gives an order for this incremental
alignment, in a bottom-up (in the tree) fashion sequences (or groups
of sequences) with greater similarity are aligned first. Therefore, in
order to have a more flexible and reusable code, the code design will
allow any binary tree of the sequences to be used as a guide tree, not
only the one built in the last step. This will allow a priori
phylogenetic or tertiary similarity (structural similarity) knowledge
be used to guide the multiple alignment order.

 This is certainly the most difficult part of the project, so to make
sure we are going to deliver a fully functional MSA algorithm, a safer
approach is going to be taken. In the first place, a a basic algorithm
described in [2] will be implemented. Once this get successfully done
and the code fully integrated to the Biojava code base, the features
described in [1] are going to be incrementally added (and tested) in
order to implement the full algorithm. This step is further divided in
substeps.

*** 4.1 Implement a first simpler dynamic programming (DP) algorithm.

  This is the generalized pairwise alignment used in each iteration of
the progressive MSA. Gaps  already presents in one of the alignments
(profiles) remain fixed, gap opening penalties remain unchanged, this
means that opening new gaps inside existent gaps will be fully
penalized. The code for this algorithm is similar to, the already
present in Biojava, code for regular pairwise alignment.

*** 4.2 Implement the basic progressive MSA algorithm.

  In this substep is going to be implemented the incremental algorithm
to built the MSA, transversing a guide tree (parameter, could be the
one built in step 3 or any other one) in a bottom-up fashion and using
the algorithm from substep 4.1 at each iteration.

*** 4.3 Implement the MSA wrapper.

  The MSA wrapper is going to be a method that wraps steps 2, 3 and
4.2, giving a simple method (for the final user) to calculate the MSA.
Receiving as parameters the set of sequences to be aligned, the gap
opening penalty, gap extend penalty and residue matrix. Returning the
MSA for the sequence set.
  At the end of this substep, we get a basic fully functional MSA
algorithm, using the progressive heuristic.

*** 4.4 Implement gaps penalties rescaling and parameter default values.

  Gap penalties to open a new gap an extend a existing one (the affine
gap weight model) are user defined parameters. This substep will
define default values, based on the residue matrix, for this
parameters and implement global rescaling rules (based on sequences
sizes) for this parameters.

*** 4.5 Enhance the DP algorithm to use different sequences weight.

  Based on the guide tree, for each sequence a different weight
(divergent sequences receive high values) is calculated and used in
the scoring scheme of the generalized DP algorithm.

*** 4.6 Enhance the DP algorithm to use position based gap penalties.

  The DP algorithm from substep 4.1 uses globally defined gap opening
penalty. In this substep, the algorithm is going to be modified do use
position based penalty, this is simple, once is known an array of
opening penalties for each sequence position. This array is calculated
based on several hierarchical (only apply the first one that fits, if
any) rules, those are rescaling rules and the array is initialized
with the original gap opening penalty.

Given the hierarchical nature of the rules, they can be implemented in
a incremental way, from the highest priority rule to the lowest, the
algorithm of each step being a refinement of the previous one. I am
omitting the detailed description of each rule. However, to verify if
a given rule apply to a given position, all that is necessary is to
check at most 16 adjacent positions and the same position in the other
already aligned sequences.

At the end of each of the following steps we a have functional
algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete.

**** 4.6.1 Lowered gap opening penalties at existing gaps.
**** 4.6.2 Increased gap opening penalties near existing gaps.
**** 4.6.3 Reduced gap opening penalties in hydrophilic stretches.
**** 4.6.4 Residue specific gap penalties.

 ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.

 EXTRA: Implement some benchmark technique to measure the final
alignment quality.

Main Risks
----------

The main risk to this project is the intrinsic complexity of the MSA
progressive algorithm. To deal with that we decided to break the
implementation in a large number of small and manageable steps, and
the steps are designed in a way that, at the end of each of them, we
will have a complete and testable new function (or a modification of
an existing one). Besides that, to be extra careful the project aims
to produce a simple full functional MSA algorithm as early as
possible, the estimated time is 8 weeks, this way we guarantee to
deliver at a simpler, but working and bug-free, version.

> Andreas
>
>
>> -------------------------------------------------------------
>>
>> GSoC proposal
>>
>> Abstract
>> --------
>>
>> This project aims to develop an all-Java implementation of a multiple
>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
>> using the progressive algorithm described in the CLUSTALW paper [1].
>>
>> The Importance
>> --------------
>>
>> Multiple sequence alignment is a frequently performed task in sequence
>> analysis with the goal to identify new members of protein families and
>> infer phylogenetic relationships between proteins and genes. At the
>> present there is no Java-only implementation for this algorithm. As
>> such the number of already existing and Java related BioInformatics
>> tools and web sites would benefit from this implementation and
>> sequence analysis could be more easily performed by the end-user.
>>
>> About Me
>> --------
>>
>> I am a graduate student at University of São Paulo (Brazil), I got my
>> undergraduate degree from the same university with a major in Computer
>> Science and a minor in Biology. I have been involved with
>> Bioinformatics for 5 years, always with sequence analysis with
>> particular interest in the MSA problem. Also, in my undergraduate
>> final project I developed a lossless filter (pruning algorithm) for
>> the MSA problem, the work is published in [3] and there is an online
>> implementation of the algorithm in [4]. Finally, I have experience
>> with the C, C++, Java, Python and Ruby programming languages; Git and
>> SVN version control systems.
>>
>> Project Plan
>> ------------
>>
>> The project is divided in four main steps, at the end of each step a
>> completely functional and bug-free new algorithm will be added to the
>> Biojava code base. It should be noticed that each step has a strong
>> dependence on the previous one, so before move to the next step a
>> careful testing will be done.
>>
>> The four steps are described below, estimated times for accomplishment
>> of each step are also given and in some steps extra enhancements are
>> described, they will be implemented if there is some time remaining
>> after all steps are completed.
>>
>> ** 1. Study the Biojava pairwise alignment code and update it to be
>> compliant with Biojava 3.
>>
>>  The pairwise alignment will play an important role in the MSA
>> algorithm. This step is also important for me to get used to the
>> Biojava coding standards and get in touch with the Biojava dev
>> community.
>>
>>  ETA: 2 weeks.
>>
>> ** 2. Implement the algorithm to build the distance matrix.
>>
>>  This is done using the pairwise alignment for each pair of sequence
>> in the set to be aligned.
>>
>>  ETA: 1 week.
>>
>>  EXTRA: Enhance the basic algorithm to use parallel strategies, use
>> several threads to calculate the pairwise alignment for different
>> pairs in the sequence set.
>>
>> ** 3. Implement the algorithm to build the guide tree.
>>
>>  The guide tree is based on the distance matrix built in the last
>> step, the tree construction strategy adopted will be the Neighbor
>> Joining Algorithm.
>>
>>  ETA: 2 weeks.
>>
>> ** 4. Implement the algorithm for progressive MSA using the guide tree.
>>
>>  This is certainly the most difficult part of the project, so to make
>> sure we are going to deliver a fully functional MSA algorithm, a safer
>> approach is going to be taken. In the first place, a dynamic
>> programming algorithm described in [2] will be implemented. Once this
>> get successfully done and the code fully integrated to the Biojava
>> code base, the features described in [1] are going to be incrementally
>> added (and tested) in order to implement the full dynamic programming
>> algorithm.
>>
>>  ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>>
>>  EXTRA: Implement some benchmark technique to measure the final
>> alignment quality.
>>
>> References
>> ----------
>>
>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
>> [3] http://www.almob.org/content/4/1/3
>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu
>>
>>
>>
>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Gustavo,
>>>
>>> In principle I agree to all, see details below:
>>>
>>>
>>> I think my question wasn't very clear, my intention in this project is
>>>>
>>>> to follow the approach (with the tree steps) outlined in the project's
>>>> page. Using the classical progressive alignment heuristic: build the
>>>> distance matrix, build the guide tree and using this tree
>>>> progressively align more sequences together.
>>>
>>> yes
>>>
>>>>
>>>> What I propose for the third step is a first implementation using the
>>>> (more simple) dynamic programming described in the first CLUSTAL paper
>>>> (I thinks it's from 1988) and incrementally improving the algorithm to
>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>>>> more or less what you had in mind?
>>>
>>> yes, sounds good.
>>>
>>>>
>>>> About parallel strategies, I think a relative easy way we could use it
>>>> is in the distance matrix construction, we could have several threads
>>>> calculating the pairwise alignment for different pairs of sequence in
>>>> the set.
>>>
>>> Correct. Probably a first implementation would be for a single machine/
>>> multi CPU. More advanced implementations could provide support e.g. for
>>> Map/Reduce, JPPF, or something like that...
>>>
>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>>>> paper doesn't give any way to measure the quality of the result, they
>>>> consider a good alignment the one that is hard to improve by eye (But
>>>> they claim that for sequences sufficient similar, no pair less than
>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>>>> paper and leave the quality measure to the user? How concerned should
>>>> I be with that in this project?
>>>
>>> Getting an overall core-algorithm that works should be priority. The
>>> benchmarking part is not mandatory, but something to keep in mind... I have
>>> plenty of material for that, once we get to that stage...
>>>
>>>> I will try send to this mailing list a proposal draft until tomorrow
>>>> to have some feedback from you.
>>>
>>> Excellent, looking forward to it.
>>>
>>> Andreas
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>