[BioRuby] Beta application for review: BioRuby - Simple duplication inference implementation

Mon Mar 29 23:32:12 UTC 2010

Hi, Jure:

Your application seems to be on the right way.

In general, your time table needs to be more detailed.
For each step you should list:
1. Goal/deliverable (you have that)
2. Approach
3. Time estimation  (you have that)
4. Anticipated problems & possible alternative approaches

Some more comments:

> 
> *The idea:*
> 
> We would implement the simple and fast duplication inference algorithm 
> described by Zmasek and Eddy (Zmasek and Eddy, 2001, "A simple algorithm 
> to infer gene duplication and speciation events on a gene tree". Finding 
> gene duplications is an extremely important part of bioinformatics and 
> biomedical research, as duplications are thought to be powerful drivers 
> in the evolution of new protein function. 

I think 'extremely important part of bioinformatics' is a somewhat of an 
exaggeration and too vague. Better write about how gene duplications 
complicate efforts on gene function prediction, and their significance 
in (the theory of) molecular evolution.

> It is thus important to find 
> gene duplication sequences, which when translated are more likely to be 
> functionally different, and distinguish them from gene speciation 
> sequences, which are more likely functionally equivalent.

'gene duplication sequences' should be 'genes related by a duplication' 
or similar.
'gene speciation sequence' should be 'genes related by a speciation' or 
similar.

> Currently the algorithm supports rooted fully binary trees and we would 
> like to change that, by also implementing support for unrooted and 
> non-binary trees.

Goals are like this:
1. Implement algorithm as it is
2. Allow rooting of unrooted gene trees by minimizing sum of duplications.
Optional:
3. Extend algorithm to work on non-binary species trees
4. Extend algorithm to work on non-binary gene trees

> 
> *The work:*
> 
> There are several milestones to be reached in developing this idea and 
> this is the work plan I propose:
> 1. Development of unit tests with known species and gene trees (1 week).
> 
> 2. Making or reusing necessary data structures, made easier by last 
> years GSoC contribution implementing phyloXML in BioRuby (1/2 weeks - 1 
> week):
> - gene tree,
> - species tree,
> - tree node,
> - children(),
> - parent().
> 
> 3. Developing checks for the correctness of input data for rooted fully 
> binary trees SDI (1/2 weeks - 1 week):
> - making sure trees are rooted and binary,
> - all species/gene tree nodes have at least on type of taxonomic data.
> - making a taxonomy base from a type of data present in all nodes 
> (scientific or common name, taxonomy code, id),
> - making sure taxonomic data is unique throughout external nodes.
> 4. Implementation of the recursive M function (1 week)
> - traverse the gene tree in postorder (left subtree, right subtree, root),
> - finding occurrences where M(parent) equals M(child 1 or 2) - this is 
> representative for finding a duplication. If M(parent) matches neither, 
> the processed node is a speciation.
> 
> 5. Milestone - finished implementation of SDI for rooted fully binary 
> trees (1/2 week):
> - Extensive testing,
> - cleaning up.
> 
> 6. Working on unrooted non-binary trees implementation (4-8 weeks):
> - Look to the forester java library SDI module for insight (by the 
> mentor of this project, Zmasek),
> - Doing some heavy lifting,
> - at this point I consider this implementation a possible pitfall, 
> because of substantially increased complexity.

This needs to much more detailed.
Species trees are always rooted.
Unrooted gene trees can be handled naively by rooting them in all 
possible places, and running the SDI algorithm on each differently 
rooted tree, and keeping the gene tree which has the lowest number of 
duplications.
A more efficient approach for this is described in:
Zmasek and Eddy (2002). RIO: analyzing proteomes by automated 
phylogenomics using resampled inference of orthologs. BMC 
Bioinformatics. 2002 May 16;3:14.
See: 
http://evogsoc2010.wordpress.com/2010/03/25/references-for-gene-duplications-proposal/

> 
> 7. Finishing up (1 week):
> - Extensive testing,
> - cleaning up.
> 
> *Why me?:*
> 
> I like to set foot on unknown territory and challenge myself constantly. 
> That being said, I have long searched for something that would connect 
> my love of medicine to my love of programming, and now, thanks to GSoC 
> and OBF, I think I found it - bioinformatics. I am at a stage of my 
> medical study, where I have to decide what my future will entail, and I 
> am (now, after thinking about it for a long time) positive that 
> bioinformatics will be a big part of it. What better way to get future 
> off to a good start, than with a Google Summer of Code project? Based on 
> this enthusiasm alone you can be assured that I'll work really hard on 
> this project and that I will be happy to see it done. As this would be 
> my first serious open source engagement, you also have a chance of 
> forming a completely new addition to the open source world and making an 
> excellent contributor out of me.
> 
> *Previous experience:*
> 
> 1. I have been working on a simulation of an analytical chemistry method 
> for the past 2 years now, more specifically we have modeled laser 
> ablation + inductively coupled plasma mass spectrometry with a simple 
> model, which aids our elemental mapping projects. For the write-up of 
> this project I have been awarded with a "Prešernovo priznanje" in 2008 
> (PDF upon request). This work entails several interesting components, 
> from basics such as: C# development, image input, output, multi-threaded 
> programming, UI development; to complex themes such as: genetic 
> algorithms and neural networks. All of which I learned as we worked on 
> the project without much hassle (source code upon request). This work is 
> not yet open source, because we are in the finalizing stages of the 
> paper and will release the source code after publication under an open 
> source license.  
> 
> 2. I have programmed since I was a child and I have developed a wide 
> specter of things in my lifetime (from a full CMS in PHP to an IRC 
> robot, source code upon request), but I have little experience in fully 
> open source projects, which I think so highly of.
> 
> *Biography:*
> 
> My name is Jure Triglav and I'm a 24 year old medical student from 
> Ljubljana, Slovenia. I was born in a small town of Murska Sobota in 
> Slovenia, where I went to grade school (graded excellent for all years, 
> awarded "Zoisova štipendija" for the gifted, which I still hold) and 
> high-school (excellent, finished as "Zlati maturant" in the company of 
> about 200 best students in the country). I moved to Ljubljana in 2004 to 
> study medicine. I am now in the last year of my medical study which I 
> find challenging and very interesting. 
> My hobbies are all over the place, from book design to photography, from 
> web design to typography, from guitar to poetry, from reading to 
> programming, from traveling to sports. 
> 
>   
> 
> *Other obligations for the summer:*
> 
> I have 5-hour daily clinical practice every weekday in June, July and 
> August, which is not nearly as serious as it sounds, especially since 
> this is the summer rotation which is known for its laid back feel. These 
> practice start at 8 am and finish at 1 pm, and for students are not 
> really stressful or exhausting at all. I have in the past juggled many 
> research obligations with clinical practice and my studies without 
> hiccups, but I will not do this this summer and will dedicate 8 hours 
> daily to Google Summer of Code, as I realize what a great opportunity 
> this is and how much work is required. I have no other work, research or 
> vacation obligations for the period of Google Summer of Code.

Neverthelessm, this sounds like a serious concern.

> 
> *Contact information: *
> 
> (I will provide additional contact information in the final application)
> Name: Jure Triglav
> E-mail: juretriglav at gmail.com <mailto:juretriglav at gmail.com>
> IRC handle: x` on #obf-soc, #gsoc
>