[BioRuby] Beta application for review: BioRuby - Simple duplication inference implementation

Christian M Zmasek czmasek at burnham.org
Fri Apr 2 22:42:03 UTC 2010

Hi, Jure:

It looks good to me!


Jure Triglav wrote:
> Thank you Christian!
> I am glad that you find it bettered and I will now submit this application to Google, then fine tune it some more until the deadline. Do you have any recommendations as to what could still be improved or added?
> Yes I agree, the extension to non-binary gene trees is the least defined problem of the proposed group of problems, so it is a good idea to somehow point that out and make its solution and implementation optional.
> And yes, the deadline for applying to clinical practice is a few weeks after April 26th, on which day the accepted GSoC proposals will be announced, so don't worry about that.
> Thank you for your reply and help!
> Best regards,
> Jure Triglav
> On Apr 2, 2010, at 8:37 PM, Christian M Zmasek wrote:
>> Hi, Jure:
>> Indeed, you improved it a lot!
>> Clearly, you don't _need_ to discuss 'anticipated problems', if you don't expect any.
>> I probably would point out that:
>> "- Extend the algorithm to support non-binary species tree as described
>> by Vernot in "Reconciliation with non-binary species tree"."
>> and
>> "- Extend the algorithm to support non-binary gene trees as described by
>> Durand in "A Hybrid Micro–Macroevolutionary Approach to Gene Tree
>> Reconstruction"."
>> are optional, especially the second one.
>> Regarding my obligations for the summer, please make sure that you _only_ cancel them if/after this has proposal been approved and you have been approved for it.
>> As you know, less than half of all proposals get accepted in the end, and each proposal has many students applying for it.
>> Christian
>> Jure Triglav wrote:
>>> Hello all!
>>> Thank you for the thorough review Christian! I took your comments seriously and spent the last two days reviewing papers and reading up on various subjects related to the proposal. I've done a lot of work and I think I have refined it substantially. I hope I have now fully elaborated on the problem and proposed time-table. I would like to kindly ask you to review it again and point out any remaining issues.
>>> The only thing from your list of requirements for the proposed time-table that I have trouble with are the anticipated problems and possible alternative approaches, since all of the developed algorithms seem rock solid to me (almost all of them have mathematical proof included) and the only possible issue that I can think of, are the incompatibilities of data structures between various algorithms (which we will address in the first week, as most are very similar) and coding errors (which we will fix, of course! :).
>>> Regarding my obligations for the summer, I would not hesitate to cancel them (or as it is, simply not apply for clinical practice, as I have no obligation yet), seeing as you consider them a serious issue. I am very motivated to do this project and would like to do everything possible to make it happen. Anyway, I can always apply for clinical practice on the next term, it really is not an issue. Best regards to all,
>>> Jure Triglav
>>> *The idea:*
>>> We would implement the simple and fast duplication inference algorithm described by Zmasek and Eddy (Zmasek and Eddy, 2001, "A simple algorithm to infer gene duplication and speciation events on a gene tree". With several billion nucleotides sequenced daily (Edwards, Hansen and Stajich, 2009, "Bioinformatics - Tools and applications"), the determination of protein function is mostly done automatically without human intervention by finding the most similar sequences that already have determined protein function (microevolutionary approach). This way of automatically determining protein function neglects additional available information in the form of macroevolutionary relationships. By inferring these interesting relationships (speciation, duplication) from a species and gene tree using a simple algorithm, we can gain a better understanding of protein function.
>>> The importance of determining these evolutionary relationships stems from a relatively simple assumption, that if two similar genes are thought to be related by speciation, their function is more likely to be similar too. On the other hand, if we determine these two genes to be related by duplications, their function is more likely to be different, as gene duplications are powerful drivers in the evolution of new protein function. This is because the second copy of a gene is often free of selective pressure and accumulates mutations more rapidly than a single copy of a gene.
>>> The original algorithm proposed by Zmasek and Eddy supports rooted fully binary gene and species trees, but we have decided to expand on that scope by implementing support for unrooted gene trees (which are produced by some bioinformatics methods and thus need to be addressed), non-binary species trees (since a lot of species trees are non-binary, i.e. 64% of NCBI nodes have more than 2 children (Vernot, 2007)), and non-binary gene trees (which are also produced by some bioinformatics methods, but represent only uncertainties, as gene trees are inherently binary).
>>> *Goals:*
>>> **
>>> *- *Implement the algorithm as described by Zmasek and Eddy in "A simple algorithm to infer gene duplication and speciation events on a gene tree", or SDI, which is designed to work on rooted gene trees and fully binary gene and species tree. - Allow rooting of unrooted gene trees by minimizing sum of duplications as described by Zmasek and Eddy in "RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs", and thus extending the implementation to support unrooted gene trees.
>>> - Extend the algorithm to support non-binary species tree as described by Vernot in "Reconciliation with non-binary species tree".
>>> - Extend the algorithm to support non-binary gene trees as described by Durand in "A Hybrid Micro–Macroevolutionary Approach to Gene Tree Reconstruction".
>>> *The work:*
>>> Some terminology:
>>> - g: a node in the gene tree
>>> - p(g): the parent of node g in the gene tree
>>> - s: a node in the species tree
>>> - M: a mapping function that links nodes of a gene tree to nodes of a species tree
>>> - roofN: an adapted mapping function for non-binary species trees
>>> - e(p(g),g): the edge between parent of g and g in the gene tree
>>> - polytomy: a species tree node with more than 2 children
>>> There are several milestones to be reached in developing this idea and this is the work plan I propose:
>>> 1. Development of unit tests with known species and gene trees (1 week).
>>> 2. Making or reusing necessary data structures, made easier by last years GSoC contribution implementing phyloXML in BioRuby (1/2 weeks - 1 week):
>>> - gene tree,
>>> - species tree,
>>> - tree node,
>>> - children(),
>>> - parent().
>>> 3. Developing checks for the correctness of input data for rooted fully binary trees SDI (1/2 weeks - 1 week):
>>> - making sure trees are rooted and binary,
>>> - all species/gene tree nodes have at least on type of taxonomic data.
>>> - making a taxonomy base from a type of data present in all nodes (scientific or common name, taxonomy code, id),
>>> - making sure taxonomic data is unique throughout external nodes.
>>> 4. Implementation of the recursive M function (1 week)
>>> - traverse the gene tree in postorder (left subtree, right subtree, root),
>>> - finding occurrences where M(parent) equals M(child 1 or 2) - this is representative for finding a duplication. If M(parent) matches neither, the processed node is a speciation.
>>> 5. Milestone - finished implementation of SDI for rooted fully binary trees (1/2 week):
>>> - Extensive testing,
>>> - polishing and writing documentation with RDoc,
>>> - cleaning up.
>>> 6. Milestone: Implementation of support for unrooted gene trees (1 week):
>>> - implement an algorithm which roots an unrooted gene tree by exploring all possible roots and selecting the one with minimum duplications,
>>> - calculating M is the most intensive step, so we only do it once for one rooted gene tree,
>>> - by moving the root one node at a time, M does not have to be calculated for every node of the gene tree, but only for two nodes:
>>> - first child of previous root, if the new root is on a brach of first child of the previous root,
>>> - second child of previous root, if the new root is on a branch of second child of the previous root,
>>> - and the new root.
>>> - traversing the whole gene tree one node at a time we explore all possible root placements and resulting duplications,
>>> - from a group of trees with a minimal number of duplications, the shortest tree is chosen as the rooted tree, - the algorithm for this is written in pseudocode in "RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs" by Zmasek and Eddy as "Algorithm for speciation duplication inference combined with rooting", and needs to be translated to Ruby code.
>>> 7. Milestone: Implementing an duplication/loss inference algorithm for non-binary species trees (described by Vernot, 2008) (2 weeks):
>>> - implement function roofN(g), which returns all roots of subtrees of the parent of "s" (s = M(g)) in which descendants of g must be present,
>>> - if the intersection of roofN(left child of g) and roofN(right child of g) is not NULL, then g is a required multiplication,
>>> - else if the intersection of roofN(left child of g) and roofN(right child of g) is NULL, then it is impossible to tell whether the event was a duplication or a deep coalescence, the event is thus called a conditional duplication,
>>> - implement function N(g), which returns all of the children of M(g), where descendants of g were present in descendants of each element in N(g), - implementing a prediction of gene loss events by assuming that minimizing gene losses gives the biologically most likely prediction by taking into account the following rules, which minimize explicit loses by predicting each loss as close as possible to the root of the gene tree:
>>> - Binary duplication loss: if p(g) is a required duplication then, if M(p(g)) != M(g) then species in (N(p(g)) without roofN(g)) are lost on edge e between p(g) and g.
>>> - Skipped species loss: if M(g) != M(p(g)) and p(M(g)) != M(p(g)) then we can infer a loss at every skipped species between M(p(g) and M(g).
>>> - Polytomy duplication loss: if M(p(g) is a polytomy and p(g) is a required duplication, then species (N(p(g)) without roofN(g)) are lost at e(p(g),g).
>>> - Polytomy speciation loss: if M(g) != M(p(g) and M(g) is a polytomy, then all children of M(g) should have a descendant of g.
>>> - the algorithm for this is written in pseudocode  in "Reconciliation with non-binary species tree" by Vernot as "Algorithm 5.1", which has to be translated to Ruby code.
>>> 8. Milestone: Implementing support for non-binary gene trees (2 weeks)
>>> - gene trees are by definition binary, but some methods produce uncertainties which result in multifurcating gene trees,
>>> - we can support non-binary gene trees by expanding multifurcating nodes to arbitrary binary trees and then optimizing the generated tree for duplications and losses with a previously developed algorithm
>>>  - this approach is described in "A Hybrid Micro–Macroevolutionary Approach to Gene Tree Reconstruction" by Durand, which contains several pseudocode algorithms that can be ported to Ruby.
>>> 9. Finishing up (1-2 weeks):
>>> - Extensive testing of all implemented algorithms,
>>> - polishing and writing documentation using RDoc,
>>> - cleaning up.
>>> *Why me?:*
>>> I like to set foot on unknown territory and challenge myself constantly. I have long searched for something that would connect my love of medicine to my love of programming, and now, thanks to GSoC and OBF, I think I found it - bioinformatics. I am at a stage of my medical study, where I have to decide what my future will entail, and I am (now, after thinking about it for a long time) positive that bioinformatics will be a big part of it. What better way to get future off to a good start, than with a Google Summer of Code project? Based on this enthusiasm alone you can be assured that I'll work really hard on this project and that I will be happy to see it done. As this would be my first serious open source engagement, you also have a chance of forming a completely new addition to the open source world and making a good contributor out of me.
>>> *Previous experience:*
>>> 1. I have been working on a simulation of an analytical chemistry method for the past 2 years now, more specifically we have modeled laser ablation + inductively coupled plasma mass spectrometry with a simple model, which aids our elemental mapping projects. For the write-up of this project I have been awarded with a "Prešernovo priznanje" in 2008 (PDF upon request). This work entails several interesting components, from basics such as: C# development, image input, output, multi-threaded programming, UI development; to complex themes such as: genetic algorithms and neural networks. All of which I learned as we worked on the project without much hassle (source code upon request). This work is not yet open source, because we are in the finalizing stages of the paper and will release the source code after publication under an open source license.  2. I have programmed since I was a child and I have developed a wide specter of things in my lifetime (from a full CMS in PHP to
 an IRC robot, source code upon request), but I have little experience in fully open source projects, which I think so highly of. *Biography:*
>>> My name is Jure Triglav and I'm a 24 year old medical student from Ljubljana, Slovenia. I was born in a small town of Murska Sobota in Slovenia, where I went to grade school (graded excellent for all years, awarded "Zoisova štipendija" for the gifted, which I still hold) and high-school (excellent, finished as "Zlati maturant" in the company of about 200 best students in the country). I moved to Ljubljana in 2004 to study medicine. I am now in the last year of my medical study which I find challenging and very interesting. My hobbies are all over the place, from book design to photography, from web design to typography, from guitar to poetry, from reading to programming, from traveling to sports.   *Other obligations for the summer:*
>>> In question, will be cancelled upon request: I have 5-hour daily clinical practice every weekday in June, July and August, which is not nearly as serious as it sounds, especially since this is the summer rotation which is known for its laid back feel. These practice start at 8 am and finish at 1 pm, and for students are not really stressful or exhausting at all. I have in the past juggled many research obligations with clinical practice and my studies without hiccups, but I will not do this this summer and will dedicate 8 hours daily to Google Summer of Code, as I realize what a great opportunity this is and how much work is required. I have no other work, research or vacation obligations for the period of Google Summer of Code.
>>> *Contact information: *
>>> (I will provide additional contact information in the final application)
>>> Name: Jure Triglav
>>> E-mail: juretriglav at gmail.com <mailto:juretriglav at gmail.com>
>>> IRC handle: x` on #obf-soc, #gsoc

More information about the BioRuby mailing list