[Bioperl-l] Summer of Code Proposal

Thu Apr 7 14:28:02 UTC 2011

Fei,

A few things.  Most important:

1) You should have a rough timeline (with actual dates) for the summer project, based on Google's events calendar (http://www.google-melange.com/gsoc/events/google/gsoc2011).  This should include start of coding as well as some general timeline of how you plan on implementing your wrappers and other related code.  

2) 'Deliverables' are needed.  How does BioPerl benefit from this?  What do we get as a result of this endeavor?

A few comments on the proposal:

1) Wrappers, by themselves, aren't necessarily difficult to write up.  The tough part is getting Bio::* objects to work with the wrapped executable and parsing output, all the while ensuring the current classes within BioPerl can deal with the data in a meaningful way.  I haven't seen that described.

2) How would you want to deal with very large data sets using SeqIO?   Or would it be better to use something like an indexed flatfile, or seqs stored in a database?

3) How do you plan on dealing with multi-state or binary state data?  I don't think there are classes that handle this data (yet), or handle it well w/o significant hackery.  hint: maybe that can be rectified...

chris

On Apr 7, 2011, at 8:08 AM, Fei Hu wrote:

> Messina :
> 
> I corrected some written mistakes.
> Also I added a new whole section talking about the RAxML and comparing it to
> others.
> Thank you so much.
> 
> Best
> Fei
> 
> On Thu, Apr 7, 2011 at 4:51 AM, Dave Messina <David.Messina at sbc.su.se>wrote:
> 
>> Hi,
>> 
>> Looking pretty good, particularly the project plan section.
>> 
>> You might also add some text in your introduction which shows the
>> importance of RaxML. Say that it's widely used and demonstrate that with
>> number of citations, number of downloads, or similar data.
>> 
>> Also, there are some small English mistakes (for example wrap instead of
>> wrapper, provide instead of provides), so ask a native English speaker to do
>> some editing.
>> 
>> Good luck! I'd love to see this happen.
>> 
>> Dave
>> 
>> 
>> On Apr 6, 2011, at 20:06, Fei Hu <hufeiyc at gmail.com> wrote:
>> 
>>> Hi all,
>>> 
>>> Below is my GoC 2011 proposal that describes my plan and thoughts.
>>> As time is really tight now, I need your advice to make it more realistic
>>> and reasonable.
>>> Appreciate your time for reviewing.
>>> Also I am looking for a mentor who is interested in this project and
>> willing
>>> to guide me through the summer.
>>> 
>>> Best
>>> Fei
>>> 
>>> PS: Thanks Chris Fields for your valuable suggestion.
>>> 
>>> 
>>> Name     Fei HU
>>> Address  Rm. 3D-11, Swearingen Engineering Building, University of South
>>> Carolina
>>> Email      hufeiyc at gmail.com
>>> 
>>> Why you are interested in the project you are proposing and are
>> well-suited
>>> to undertake it.
>>> I like to use Perl to organize and automate the pipeline, starting from
>>> extracting data, run various packages and analysis results. And I would
>> like
>>> more people to know its virtue and make use of it. Bio-Perl provides us a
>>> perfect platform.
>>> My current research is about gene order phylogeny reconstruction
>> following
>>> maximum likelihood criteria(others includes MP and NJ based). My
>> phylogeny
>>> inference pipeline involves using RAxML to build a ML tree and estimating
>>> the internal(ancestral) sequence using PAML. While baseml of PAML is
>>> well-supported, RAxML is not yet available. Although I wrote my own wrap
>> for
>>> RAxML, it’s even better for Bio-Perl to wrap RAxML so that everyone can
>> use
>>> easily.
>>> I extensively used and also modified the source to fit RAxML to analysis
>>> gene order data. With a good understanding of Perl and RAxML, what’s
>> more,
>>> the willing to make Bio-perl better, I am prepared to undertake it.
>>> Programs or projects you have previously authored or contributed to
>>> I implemented the algorithm using Perl[1](open source). And I also use
>> and
>>> learn Perl in daily bases.
>>> A project plan for the project you are proposing
>>> The wrap should be consistent with the other existing packages supported
>> by
>>> Tools::Run in style and api. I plan to it to full-fill most popular
>>> functionality RAxML currently provide.
>>> 1. Binary Sequence analysis (0-1, binary characters ) and Multi-sates
>>> Sequence analysis (0-9A-V, 32 characters, available models are: ORDERED,
>> MK,
>>> GTR), this is useful for morphological data.
>>> 2. DNA analysis and Amino Acid analysis, given custom transition
>> matrix(AA
>>> only), rate heterogeneity.
>>> 4. Conduct standard bootstrapping and rapid bootstrapping as well as the
>>> final through inference[2] as well as the relative new bootstopping.
>>> 5. Given user starting tree or incomplete constrain tree.
>>> 6. Specify a column weight file name to assign individual weights to each
>>> column of the alignment.
>>> 7. Specify an exclude file name, that contains a specification of
>> alignment
>>> positions you wish to exclude.
>>> 8. Automatically generate random seed for the program.
>>> 9. And more to be added.
>>> Others plan that may benefit user.
>>> 1. Call Bio::SeqIO to parse and reconstruct interleave or sequential
>> phylip
>>> format so that RAxML can read.
>>> 2. Design a set of more understandable commands, such as
>>> use “--model” instead of “-P” to specify a custom model file.
>>> use “--workingdir” instead of “-w” to specify the working directory.
>>> But still one can use the old style according to their own preference.
>>> 3. Implement more sophisticated exception handler and running mode
>> summary.
>>> There is huge combination of arguments that can cause error. For example,
>> to
>>> enable a rapid bootstrapping plus a thorough inference, one needs to give
>>> “-f a” “-x {random seed}” together with the number of replicates “-#
>>> {number}”, if anyone is missing, RAxML won’t tell at once that these
>> three
>>> are all necessary, instead RAxML usually can only tell the “nearest”
>> error
>>> it can spot. In my plan if one wants to conduct a RBS plus inference, the
>>> wrap is able to inform user that all those three are necessary and then
>>> guides to correct it.In sum, I plan to dig the errors from source code
>> and
>>> group them in accordance to their functionality. So each error message
>> will
>>> no longer be independent.
>>> Another “trivial” thoughts is when the running-id already exists, RAxML
>> will
>>> exit directly without choice, this would be disturbing if overwrite is
>> fine,
>>> I suggest to use a switch to define the behavior(overwrite, add a
>> post-fix
>>> to name, exit, skip this run).
>>> 4. Preliminary post-processing can be conducted and afterward returned as
>> a
>>> value or list.  Output the maximum likelihood scores for each
>> bootstrapped
>>> tree. Enumerate branches that have confidence value larger than a
>> threthold.
>>> Return a hash table containing branch lengths and running time, final ML
>>> score.More analysis could be done by other package anyway.
>>> 
>>> Any obligations, vacations, or plans for the summer that may require
>>> scheduling during the GSoC work period.
>>> No special obligations and vacations.
>>> 
>>> 
>>> [1]Hu, F., Gao, N. and Tang, J., "Maximum Likelihood Phylogenetic
>>> Reconstruction Using Gene Order Encodings", CIBCB 2011, accepted.
>>> [2]Stamatakis A, Hoover P, Rougemont J: A rapid bootstrap algorithm for
>> the
>>> RAxML web-servers. Syst. Biol. 2008, 75:758–771.
>>> 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> 
> 
> 
> -- 
> *Fei Hu
> Bioinformatics Lab
> 3D-11 Swearingen Building
> U of South Carolina
> Tel: 803-397-5240*
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l