[Biopython] quantile normalization method

Sat Mar 20 19:30:45 UTC 2010

On 3/20/10 7:26 PM, Vincent Davis wrote:
>     @Laurent Gautier
>
>     The algorithm is fairly straightforward, as you noted it, but beware
>     of details such missing values, ability to normalize against a
>     target distribution, or ties when ranking (although I'd have to
>     check if those receive a special treatment).The quantile
>     normalization code in the R package "preprocessCore" is in C and
>     might outperform a pure Python implementation.
>
>
> Not sure about speed. I have 84 microarrays samples with ~190,000 probes
> and it normalizes in 7 sec. I have no idea how fast R is or how many
> arrays are common to normalize.

So speed is not an issue for your use-case; even a 10x speedup might not 
justify the effort required to move to C, as this operation is performed 
once in a while (once per dataset mostly).

I am not sure there is a "common" number. When still working with 
arrays, I can find myself with several hundred arrays with ~2 million 
probes each.

>     There is a variety of normalization methods in bioconductor, and it
>     might make sense to embrace it as a dependency (rather than
>     reimplement it). I have bindings for Bioconductor up my sleeve about
>     to be distributed to few people for testing. The public release
>     might be around ISMB, BOSC time.
>
>
> I considered this and in the long run you might be right. But I don't
> know R and I placed more value on understanding the normalization than
> learning R. This is in part because there is little advantage in using R
> in the next steps of my analysis.

Surprising, but you'll know best.

> Bindings seem like a good idea but
> they would be a black box to me. I guess for me since most of this is
> new the value of implementing my own normalization in both learning more
> about python and understanding the normalization out ways the benefits
> of implementing it in R.

Everyone's mileage will vary. I often like building on existing 
libraries (although I frequently read how methods work): this makes my 
palette of tools richer than if I had to reimplement everything, and 
gives me time to create my own.
Having this said, learning a language by implementing is a great way to go.

> As a side question, why use biopython, are there ways in which it is
> better than R ?

In short (and therefore with some imprecision and/or distortion), 
Biopython is a "Python package" (i.e., collection of modules) for 
bioinformatics, with a forte in handling a number of bioinformatics file 
formats. R is a language for statistics, data analysis and graphics.

> For me it is purely that I know python (a little) and can nothing about
> R. Sure If I am just doing through step by step instruction from
> a bioconductor use manual I am fine but once I what to do something new
> am am lost. Not that I can't learn I am just prioritizing my learning.

Then the idea is that you consider R/bioconductor as a Python library. 
Should you want something new, you can then implement it in Python.

Laurent

>
> And thanks for this
>
>     norm_a = numpy.array(normq(m))
>
>     can be replaced by
>
>     norm_a = numpy.as_array(normq(m))
>
>     to improve performances whenever m is of substantial size (as no
>     copy is made - see
>     http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy)
>
>
>
> 		
>
> *Vincent Davis
> 720-301-3003 *
> vincent at vincentdavis.net <mailto:vincent at vincentdavis.net>
>
> my blog <http://vincentdavis.net> | LinkedIn
> <http://www.linkedin.com/in/vincentdavis>
>
>
>
> On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier <lgautier at gmail.com
> <mailto:lgautier at gmail.com>> wrote:
>
>     Hi Bartek and Vincent,
>
>     Few comments:
>
>     A/
>
>     The algorithm is fairly straightforward, as you noted it, but beware
>     of details such missing values, ability to normalize against a
>     target distribution, or ties when ranking (although I'd have to
>     check if those receive a special treatment).
>     The quantile normalization code in the R package "preprocessCore" is
>     in C and might outperform a pure Python implementation.
>
>     B/
>
>     There is a variety of normalization methods in bioconductor, and it
>     might make sense to embrace it as a dependency (rather than
>     reimplement it). I have bindings for Bioconductor up my sleeve about
>     to be distributed to few people for testing. The public release
>     might be around ISMB, BOSC time.
>
>     C/
>
>
>     norm_a = numpy.array(normq(m))
>
>     can be replaced by
>
>     norm_a = numpy.as_array(normq(m))
>
>     to improve performances whenever m is of substantial size (as no
>     copy is made - see
>     http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy
>     )
>
>
>
>     Best,
>
>
>     Laurent
>
>
>
>
>     On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org
>     <mailto:biopython-request at lists.open-bio.org> wrote:
>
>              >  Is there a quantile normalization method in biopython, I
>             search but did not
>              >  find. If not it looks straight forward would it be of
>             any interest to the
>              >  community for me to contribute a method
>              >
>              >  1. given n arrays of length p, form X of dimension
>              >  p ? n where each array is a column;
>              >  2. sort each column of X to give X sort ;
>              >  3. take the means across rows of X sort and assign this
>              >  mean to each element in the row to get X sort ;
>              >  4. get X normalized by rearranging each column of
>              >  X sort to have the same ordering as original X
>              >
>              >  From
>              >  A comparison of normalization methods for high
>              >  density oligonucleotide array data based on
>              >  variance and bias
>              >  B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T.
>             P. Speed 4, 5
>              >  ?
>              >
>
>           Hi,
>
>         I don't think there is such a method available.
>
>         I'm myself using the original R implementation by Bolstad et al.
>         It requires
>         rPy and R installed. It can be achieved in a few lines of code:
>
>         <pre>
>         import rpy2.robjects as robjects
>         #ll = list of concatenated values to normalize
>         v = robjects.FloatVector(ll)
>         #numrows=number of vectors that made up ll
>         m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>         robjects.r('require("preprocessCore")')
>         normq=robjects.r('normalize.quantiles')
>         norm_a=numpy.array(normq(m))
>         #norm_a=normalized array
>         </pre>
>
>         If your method is a pure python implementation which is
>         comparably fast I
>         think it would be worth to have it in Biopython since the method
>         is (in my
>         opinion) quite useful and it would remove the dependency on R
>         from some of
>         my scripts.
>
>         cheers
>           Bartek
>
>
>