[Bioperl-l] Summer of Code Proposal

Wed Apr 6 18:06:17 UTC 2011

Hi all,

Below is my GoC 2011 proposal that describes my plan and thoughts.
As time is really tight now, I need your advice to make it more realistic
and reasonable.
Appreciate your time for reviewing.
Also I am looking for a mentor who is interested in this project and willing
to guide me through the summer.

Best
Fei

PS: Thanks Chris Fields for your valuable suggestion.

Name     Fei HU
Address  Rm. 3D-11, Swearingen Engineering Building, University of South
Carolina
Email      hufeiyc at gmail.com

Why you are interested in the project you are proposing and are well-suited
to undertake it.
I like to use Perl to organize and automate the pipeline, starting from
extracting data, run various packages and analysis results. And I would like
more people to know its virtue and make use of it. Bio-Perl provides us a
perfect platform.
My current research is about gene order phylogeny reconstruction following
maximum likelihood criteria(others includes MP and NJ based). My phylogeny
inference pipeline involves using RAxML to build a ML tree and estimating
the internal(ancestral) sequence using PAML. While baseml of PAML is
well-supported, RAxML is not yet available. Although I wrote my own wrap for
RAxML, it’s even better for Bio-Perl to wrap RAxML so that everyone can use
easily.
I extensively used and also modified the source to fit RAxML to analysis
gene order data. With a good understanding of Perl and RAxML, what’s more,
the willing to make Bio-perl better, I am prepared to undertake it.
Programs or projects you have previously authored or contributed to
I implemented the algorithm using Perl[1](open source). And I also use and
learn Perl in daily bases.
A project plan for the project you are proposing
The wrap should be consistent with the other existing packages supported by
Tools::Run in style and api. I plan to it to full-fill most popular
functionality RAxML currently provide.
1. Binary Sequence analysis (0-1, binary characters ) and Multi-sates
Sequence analysis (0-9A-V, 32 characters, available models are: ORDERED, MK,
GTR), this is useful for morphological data.
2. DNA analysis and Amino Acid analysis, given custom transition matrix(AA
only), rate heterogeneity.
4. Conduct standard bootstrapping and rapid bootstrapping as well as the
final through inference[2] as well as the relative new bootstopping.
5. Given user starting tree or incomplete constrain tree.
6. Specify a column weight file name to assign individual weights to each
column of the alignment.
7. Specify an exclude file name, that contains a specification of alignment
positions you wish to exclude.
8. Automatically generate random seed for the program.
9. And more to be added.
Others plan that may benefit user.
1. Call Bio::SeqIO to parse and reconstruct interleave or sequential phylip
format so that RAxML can read.
2. Design a set of more understandable commands, such as
use “--model” instead of “-P” to specify a custom model file.
use “--workingdir” instead of “-w” to specify the working directory.
But still one can use the old style according to their own preference.
3. Implement more sophisticated exception handler and running mode summary.
There is huge combination of arguments that can cause error. For example, to
enable a rapid bootstrapping plus a thorough inference, one needs to give
“-f a” “-x {random seed}” together with the number of replicates “-#
{number}”, if anyone is missing, RAxML won’t tell at once that these three
are all necessary, instead RAxML usually can only tell the “nearest” error
it can spot. In my plan if one wants to conduct a RBS plus inference, the
wrap is able to inform user that all those three are necessary and then
guides to correct it.In sum, I plan to dig the errors from source code and
group them in accordance to their functionality. So each error message will
no longer be independent.
Another “trivial” thoughts is when the running-id already exists, RAxML will
exit directly without choice, this would be disturbing if overwrite is fine,
I suggest to use a switch to define the behavior(overwrite, add a post-fix
to name, exit, skip this run).
4. Preliminary post-processing can be conducted and afterward returned as a
value or list.  Output the maximum likelihood scores for each bootstrapped
tree. Enumerate branches that have confidence value larger than a threthold.
Return a hash table containing branch lengths and running time, final ML
score.More analysis could be done by other package anyway.

Any obligations, vacations, or plans for the summer that may require
scheduling during the GSoC work period.
No special obligations and vacations.

[1]Hu, F., Gao, N. and Tang, J., "Maximum Likelihood Phylogenetic
Reconstruction Using Gene Order Encodings", CIBCB 2011, accepted.
[2]Stamatakis A, Hoover P, Rougemont J: A rapid bootstrap algorithm for the
RAxML web-servers. Syst. Biol. 2008, 75:758–771.