[Bioperl-l] Summer of Code Proposal

Thu Apr 7 16:23:46 UTC 2011

Hi Chris

I added a lot to the proposal according to your valuable comments, including
a timeline, benefit, Bio:Seq and SeqIO concern and how it can work with
other objects.
Thank you!

Best
Fei

On Thu, Apr 7, 2011 at 10:28 AM, Chris Fields <cjfields at illinois.edu> wrote:

> Fei,
>
> A few things.  Most important:
>
> 1) You should have a rough timeline (with actual dates) for the summer
> project, based on Google's events calendar (
> http://www.google-melange.com/gsoc/events/google/gsoc2011).  This should
> include start of coding as well as some general timeline of how you plan on
> implementing your wrappers and other related code.
>
> 2) 'Deliverables' are needed.  How does BioPerl benefit from this?  What do
> we get as a result of this endeavor?
>
> A few comments on the proposal:
>
> 1) Wrappers, by themselves, aren't necessarily difficult to write up.  The
> tough part is getting Bio::* objects to work with the wrapped executable and
> parsing output, all the while ensuring the current classes within BioPerl
> can deal with the data in a meaningful way.  I haven't seen that described.
>
> 2) How would you want to deal with very large data sets using SeqIO?   Or
> would it be better to use something like an indexed flatfile, or seqs stored
> in a database?
>
> 3) How do you plan on dealing with multi-state or binary state data?  I
> don't think there are classes that handle this data (yet), or handle it well
> w/o significant hackery.  hint: maybe that can be rectified...
>
> chris
>
> On Apr 7, 2011, at 8:08 AM, Fei Hu wrote:
>
> > Messina :
> >
> > I corrected some written mistakes.
> > Also I added a new whole section talking about the RAxML and comparing it
> to
> > others.
> > Thank you so much.
> >
> > Best
> > Fei
> >
> > On Thu, Apr 7, 2011 at 4:51 AM, Dave Messina <David.Messina at sbc.su.se
> >wrote:
> >
> >> Hi,
> >>
> >> Looking pretty good, particularly the project plan section.
> >>
> >> You might also add some text in your introduction which shows the
> >> importance of RaxML. Say that it's widely used and demonstrate that with
> >> number of citations, number of downloads, or similar data.
> >>
> >> Also, there are some small English mistakes (for example wrap instead of
> >> wrapper, provide instead of provides), so ask a native English speaker
> to do
> >> some editing.
> >>
> >> Good luck! I'd love to see this happen.
> >>
> >> Dave
> >>
> >>
> >> On Apr 6, 2011, at 20:06, Fei Hu <hufeiyc at gmail.com> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Below is my GoC 2011 proposal that describes my plan and thoughts.
> >>> As time is really tight now, I need your advice to make it more
> realistic
> >>> and reasonable.
> >>> Appreciate your time for reviewing.
> >>> Also I am looking for a mentor who is interested in this project and
> >> willing
> >>> to guide me through the summer.
> >>>
> >>> Best
> >>> Fei
> >>>
> >>> PS: Thanks Chris Fields for your valuable suggestion.
> >>>
> >>>
> >>> Name     Fei HU
> >>> Address  Rm. 3D-11, Swearingen Engineering Building, University of
> South
> >>> Carolina
> >>> Email      hufeiyc at gmail.com
> >>>
> >>> Why you are interested in the project you are proposing and are
> >> well-suited
> >>> to undertake it.
> >>> I like to use Perl to organize and automate the pipeline, starting from
> >>> extracting data, run various packages and analysis results. And I would
> >> like
> >>> more people to know its virtue and make use of it. Bio-Perl provides us
> a
> >>> perfect platform.
> >>> My current research is about gene order phylogeny reconstruction
> >> following
> >>> maximum likelihood criteria(others includes MP and NJ based). My
> >> phylogeny
> >>> inference pipeline involves using RAxML to build a ML tree and
> estimating
> >>> the internal(ancestral) sequence using PAML. While baseml of PAML is
> >>> well-supported, RAxML is not yet available. Although I wrote my own
> wrap
> >> for
> >>> RAxML, it’s even better for Bio-Perl to wrap RAxML so that everyone can
> >> use
> >>> easily.
> >>> I extensively used and also modified the source to fit RAxML to
> analysis
> >>> gene order data. With a good understanding of Perl and RAxML, what’s
> >> more,
> >>> the willing to make Bio-perl better, I am prepared to undertake it.
> >>> Programs or projects you have previously authored or contributed to
> >>> I implemented the algorithm using Perl[1](open source). And I also use
> >> and
> >>> learn Perl in daily bases.
> >>> A project plan for the project you are proposing
> >>> The wrap should be consistent with the other existing packages
> supported
> >> by
> >>> Tools::Run in style and api. I plan to it to full-fill most popular
> >>> functionality RAxML currently provide.
> >>> 1. Binary Sequence analysis (0-1, binary characters ) and Multi-sates
> >>> Sequence analysis (0-9A-V, 32 characters, available models are:
> ORDERED,
> >> MK,
> >>> GTR), this is useful for morphological data.
> >>> 2. DNA analysis and Amino Acid analysis, given custom transition
> >> matrix(AA
> >>> only), rate heterogeneity.
> >>> 4. Conduct standard bootstrapping and rapid bootstrapping as well as
> the
> >>> final through inference[2] as well as the relative new bootstopping.
> >>> 5. Given user starting tree or incomplete constrain tree.
> >>> 6. Specify a column weight file name to assign individual weights to
> each
> >>> column of the alignment.
> >>> 7. Specify an exclude file name, that contains a specification of
> >> alignment
> >>> positions you wish to exclude.
> >>> 8. Automatically generate random seed for the program.
> >>> 9. And more to be added.
> >>> Others plan that may benefit user.
> >>> 1. Call Bio::SeqIO to parse and reconstruct interleave or sequential
> >> phylip
> >>> format so that RAxML can read.
> >>> 2. Design a set of more understandable commands, such as
> >>> use “--model” instead of “-P” to specify a custom model file.
> >>> use “--workingdir” instead of “-w” to specify the working directory.
> >>> But still one can use the old style according to their own preference.
> >>> 3. Implement more sophisticated exception handler and running mode
> >> summary.
> >>> There is huge combination of arguments that can cause error. For
> example,
> >> to
> >>> enable a rapid bootstrapping plus a thorough inference, one needs to
> give
> >>> “-f a” “-x {random seed}” together with the number of replicates “-#
> >>> {number}”, if anyone is missing, RAxML won’t tell at once that these
> >> three
> >>> are all necessary, instead RAxML usually can only tell the “nearest”
> >> error
> >>> it can spot. In my plan if one wants to conduct a RBS plus inference,
> the
> >>> wrap is able to inform user that all those three are necessary and then
> >>> guides to correct it.In sum, I plan to dig the errors from source code
> >> and
> >>> group them in accordance to their functionality. So each error message
> >> will
> >>> no longer be independent.
> >>> Another “trivial” thoughts is when the running-id already exists, RAxML
> >> will
> >>> exit directly without choice, this would be disturbing if overwrite is
> >> fine,
> >>> I suggest to use a switch to define the behavior(overwrite, add a
> >> post-fix
> >>> to name, exit, skip this run).
> >>> 4. Preliminary post-processing can be conducted and afterward returned
> as
> >> a
> >>> value or list.  Output the maximum likelihood scores for each
> >> bootstrapped
> >>> tree. Enumerate branches that have confidence value larger than a
> >> threthold.
> >>> Return a hash table containing branch lengths and running time, final
> ML
> >>> score.More analysis could be done by other package anyway.
> >>>
> >>> Any obligations, vacations, or plans for the summer that may require
> >>> scheduling during the GSoC work period.
> >>> No special obligations and vacations.
> >>>
> >>>
> >>> [1]Hu, F., Gao, N. and Tang, J., "Maximum Likelihood Phylogenetic
> >>> Reconstruction Using Gene Order Encodings", CIBCB 2011, accepted.
> >>> [2]Stamatakis A, Hoover P, Rougemont J: A rapid bootstrap algorithm for
> >> the
> >>> RAxML web-servers. Syst. Biol. 2008, 75:758–771.
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >
> >
> >
> > --
> > *Fei Hu
> > Bioinformatics Lab
> > 3D-11 Swearingen Building
> > U of South Carolina
> > Tel: 803-397-5240*
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>

-- 
*Fei Hu
Bioinformatics Lab
3D-11 Swearingen Building
U of South Carolina
Tel: 803-397-5240*