[Biopython] Feature selection techniques modules

Peter Cock p.j.a.cock at googlemail.com
Sun Feb 6 22:05:49 UTC 2011


On Sun, Feb 6, 2011 at 8:37 PM, chris dimitrakopoulos
<dimitrakopoul at gmail.com> wrote:
> Hello everyone,
>
> I am an msc student in University of Patras, Greece, in the research field
> of Bioinformatics. I recently become a member of the OBF and i appreciate
> the open source work of your OBF project.
>
> I had a discussion with Mr. Robert Buels about this year gsoc, cause i look
> forward to make an application and i found that OBF would be the
> organization most suitable for me. Generally, i was idling in the projects
> announced on previous years and i found them very interesting. As this
> year's potential projects have not been announced yet, i wanted to express
> to you an idea of mine, say briefly what I am thinking of doing, and ask you
> if you think it is a good idea and it is worth to make an application with
> this subject after March 28.
>
> Well, I think that feature selection techniques have become a very important
> issue in many bioinformatics implementations. In many cases (like protein
> interactions prediction), you have to find a way to collect the best set of
> features that leads to the best classification performance. I looked in
> Biopython libraries and i didn't find something relative about FS techniques
> implementation to a dataset of features (like t-test, ANOVA, Wilcoxon, CFS
> etc... ). Hence, i think that the creation of a library focused on FS
> techniques would be a good idea. Moreover, that library can have an
> hierarchical structure as there are different types of FS techniques, like
> filter, wrapper and embedded techniques. Furthermore, each type of them is
> divided into more groups, (f.e. filter methods are divided into univariate
> and multivariate methods, according to the consideration of feature
> dependencies) etc...
>
> Only some of the methods i am thinking of implementing are:
>
> T-test, ANOVA, Gamma, bivariate methods, CFS, MRMR which are some known
> filter feature selection techniques.
> In wrapper and embedded methods, the classifiers are been used in the
> process of feature selection, so we have techniques based on Genetic
> algorithms, Random forests, logistic regression, Decision Tree Learners,
> Bayesian Classifiers, etc.. In this case, the existing Biopython modules
> Bio.LogisticRegression, Bio.GA and Bio.NaiveBayes could be used.
>
> More information on the techniques I describe can be found on the following
> links:
>
> http://bioinformatics.oxfordjournals.org/content/23/19/2507.full.pdf+html
> http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3570EDE4C7E11AAE7CA5F727800DC58A?doi=10.1.1.37.4643&rep=rep1&type=pdf
>
> New functions computing the above measures can be created. The calculation
> can be done between vectors of features, between a feature vector and the
> output vector, or even if in large datasets (with many features) been readen
> from a file, in which we want to implement feature selections.
>
> I send to you this email in order to express briefly my idea. Please let me
> know what do you think about it and if it is worth been proposed as one of
> my student applications in gsoc 2011, to open bioinformatics foundation. If
> you want me to tell you any further details about my thinking just ask me!
> :-)
>
> Look forward to hearing from you,
> Chris Dim

Hello Chris,

This sounds interesting - a provided we can find some suitable mentors
it could turn into a Google Summer of Code project. Something you
could start with (now or as one of the first tasks if you write up a GSoC
proposal) could be to understand the existing code in Biopython in
this area (Bio.LogisticRegression, Bio.GA, Bio.NaiveBayes etc) and
perhaps writing extra documentation for them (they are not covered
in the tutorial at all), and perhaps some more unit tests too.

One thing I would suggest checking is how much of the statistical code
you mention is already written in other Python libraries (e.g. SciPy).
For something as complicated as statistical testing there is no point
reimplementing it. Tiago has previously said there are statics routines
in SciPy he may want to use in his Biopython code for population
genetics. So, check out SciPy: http://scipy.org/

Regards,

Peter



More information about the Biopython mailing list