[Bioperl-l] Summer of Code Proposal

Thu Apr 7 08:51:48 UTC 2011

Hi,

Looking pretty good, particularly the project plan section.

You might also add some text in your introduction which shows the importance of RaxML. Say that it's widely used and demonstrate that with  number of citations, number of downloads, or similar data.

Also, there are some small English mistakes (for example wrap instead of wrapper, provide instead of provides), so ask a native English speaker to do some editing.

Good luck! I'd love to see this happen.

Dave

On Apr 6, 2011, at 20:06, Fei Hu <hufeiyc at gmail.com> wrote:

> Hi all,
> 
> Below is my GoC 2011 proposal that describes my plan and thoughts.
> As time is really tight now, I need your advice to make it more realistic
> and reasonable.
> Appreciate your time for reviewing.
> Also I am looking for a mentor who is interested in this project and willing
> to guide me through the summer.
> 
> Best
> Fei
> 
> PS: Thanks Chris Fields for your valuable suggestion.
> 
> 
> Name     Fei HU
> Address  Rm. 3D-11, Swearingen Engineering Building, University of South
> Carolina
> Email      hufeiyc at gmail.com
> 
> Why you are interested in the project you are proposing and are well-suited
> to undertake it.
> I like to use Perl to organize and automate the pipeline, starting from
> extracting data, run various packages and analysis results. And I would like
> more people to know its virtue and make use of it. Bio-Perl provides us a
> perfect platform.
> My current research is about gene order phylogeny reconstruction following
> maximum likelihood criteria(others includes MP and NJ based). My phylogeny
> inference pipeline involves using RAxML to build a ML tree and estimating
> the internal(ancestral) sequence using PAML. While baseml of PAML is
> well-supported, RAxML is not yet available. Although I wrote my own wrap for
> RAxML, it’s even better for Bio-Perl to wrap RAxML so that everyone can use
> easily.
> I extensively used and also modified the source to fit RAxML to analysis
> gene order data. With a good understanding of Perl and RAxML, what’s more,
> the willing to make Bio-perl better, I am prepared to undertake it.
> Programs or projects you have previously authored or contributed to
> I implemented the algorithm using Perl[1](open source). And I also use and
> learn Perl in daily bases.
> A project plan for the project you are proposing
> The wrap should be consistent with the other existing packages supported by
> Tools::Run in style and api. I plan to it to full-fill most popular
> functionality RAxML currently provide.
> 1. Binary Sequence analysis (0-1, binary characters ) and Multi-sates
> Sequence analysis (0-9A-V, 32 characters, available models are: ORDERED, MK,
> GTR), this is useful for morphological data.
> 2. DNA analysis and Amino Acid analysis, given custom transition matrix(AA
> only), rate heterogeneity.
> 4. Conduct standard bootstrapping and rapid bootstrapping as well as the
> final through inference[2] as well as the relative new bootstopping.
> 5. Given user starting tree or incomplete constrain tree.
> 6. Specify a column weight file name to assign individual weights to each
> column of the alignment.
> 7. Specify an exclude file name, that contains a specification of alignment
> positions you wish to exclude.
> 8. Automatically generate random seed for the program.
> 9. And more to be added.
> Others plan that may benefit user.
> 1. Call Bio::SeqIO to parse and reconstruct interleave or sequential phylip
> format so that RAxML can read.
> 2. Design a set of more understandable commands, such as
> use “--model” instead of “-P” to specify a custom model file.
> use “--workingdir” instead of “-w” to specify the working directory.
> But still one can use the old style according to their own preference.
> 3. Implement more sophisticated exception handler and running mode summary.
> There is huge combination of arguments that can cause error. For example, to
> enable a rapid bootstrapping plus a thorough inference, one needs to give
> “-f a” “-x {random seed}” together with the number of replicates “-#
> {number}”, if anyone is missing, RAxML won’t tell at once that these three
> are all necessary, instead RAxML usually can only tell the “nearest” error
> it can spot. In my plan if one wants to conduct a RBS plus inference, the
> wrap is able to inform user that all those three are necessary and then
> guides to correct it.In sum, I plan to dig the errors from source code and
> group them in accordance to their functionality. So each error message will
> no longer be independent.
> Another “trivial” thoughts is when the running-id already exists, RAxML will
> exit directly without choice, this would be disturbing if overwrite is fine,
> I suggest to use a switch to define the behavior(overwrite, add a post-fix
> to name, exit, skip this run).
> 4. Preliminary post-processing can be conducted and afterward returned as a
> value or list.  Output the maximum likelihood scores for each bootstrapped
> tree. Enumerate branches that have confidence value larger than a threthold.
> Return a hash table containing branch lengths and running time, final ML
> score.More analysis could be done by other package anyway.
> 
> Any obligations, vacations, or plans for the summer that may require
> scheduling during the GSoC work period.
> No special obligations and vacations.
> 
> 
> [1]Hu, F., Gao, N. and Tang, J., "Maximum Likelihood Phylogenetic
> Reconstruction Using Gene Order Encodings", CIBCB 2011, accepted.
> [2]Stamatakis A, Hoover P, Rougemont J: A rapid bootstrap algorithm for the
> RAxML web-servers. Syst. Biol. 2008, 75:758–771.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l