[Bioperl-l] Microarray ANOVA module

Tue, 13 Aug 2002 08:14:37 -0400

Peter,

Wasn't Nathan Siemers of BMS talking about a package for microarray data
some time ago? I'll be corrected if I'm wrong. Anyway, Yujin Hoshida
contributed a package of scripts a while back, the directory is
scripts/contributed/expression_analysis. Here is the README:

Contributed by Yujin Hoshida <d35116@h.u-tokyo.ac.jp>
-----------------------------------------------------------

I send my perl scripts that handle DNA microarray data. All scripts get data
from comma-delimited text file. They consist of 3 groups as below.
I am sorry for delay. It took time for preparation owing to my little
daughter's heavy crying at night (she is 3 months old).

(1) discriminative gene selection.

       t_test.pl     :     permutation t-test (Radmacher, NCI report)
       u_test.pl     :    Mann-Whiteney U-test
       info.pl        :    Info-score (TNoM) (Ben-Dor, JCB)
       ds.pl          :    discrimination score (Golub, Science)
       cat.pl         :    categolization (eg 3 groups: Cy3/Cy5 >=2, 0.5<
Cy3/Cy5 <2, Cy3/Cy5 <=0.5)
                           (Tsunoda, Cancer Res)

These scripts select genes that discriminate 2 groups (4 samples in each
group, the minimal number that I think) based on 10,000 random permutation
of the sample labels. Threshold level is set to P=.001 (ie superior to top
or bottom 10 permutations). Difference among these script are only gene
selecting algorithms.

They need re-writing according to sample number of objective microarray
data.

(2) leave-one-out cross validation of (1).

                                        gene selection
in-silico genotyping
        loocv_t.pl           :              t-test
compound covariate (Radmacher)
        loocv_u.pl           :              U-test
simple rank
        loocv_t_vote.pl     :               t-test
weighted vote (Golub)
        loocv_u_vote.pl     :              U-test
weighted vote
        loocv_info.pl        :        Info-score (TNoM)
weighted vote
        loocv_ds.pl          :      discrimination score
weighted vote
        loocv_cat.pl         :          categolization
weighted vote

 These scripts evaluate (1). One sample is removed and discriminative genes
are selected using remaining samples using each algorithm. Removed sample is
genotyped using selected gene set and judged whether the genotyping is
correct or not. This process is repeated for all samples and the number of
misclassification is counted. Furthermore, sample labels are randomly
permutated 1,000 times and its significance (P=.05) is evaluated.

Infoscore and TNoM are not calculated in these scripts (calculated
beforehand manually).
The problem of these scripts is huge calculation time.
Dr.Jason, if I use other compiling-type language instead of Perl
(interpriter-type language), is this problem solved?

(3) relevance network of gene expression (Butte, PNAS)

        entropy.pl          :       select genes with sufficient entropy for
calculation of correlation coefficient.
        relevance.pl       :       calculate Pearson correlation coefficient
among genes
                                     , and select genes with higher
correlation coefficient than threshold value.
        relrand.pl          :       calculate threshold value of Pearson
correlation coefficient.

Now I am developing a JAVA application that visualize the relevance network
based on data sheet derived from relevance.pl script.
relevnce.pl also takes huge calculation time (probably in the step of
sorting correlation coefficients: eg, from DNA array with 5,000 genes,
12,497,500 correlation coefficients are calculated). I think that some
improvement is needed (eg using bubble sort).

My coding is not elegant. Please tell me the point that needs revision.
In addition, I apologize for my poor English explanation.

King regards,

Yujin

Brian O.

-----Original Message-----
From: bioperl-l-admin@bioperl.org [mailto:bioperl-l-admin@bioperl.org]On
Behalf Of Robinson, Peter
Sent: Tuesday, August 13, 2002 4:19 AM
To: bioperl-l@bioperl.org
Subject: [Bioperl-l] Microarray ANOVA module

Hi All,

I would like to ask about the status of Bioperl's plans for modules for
microarray data analysis. I have written a module that does ANOVA (F test)
analysis of microarray data and would like to make it accessible somewhere
but am not sure about the proper place and would like to ask if someone on
this list would be willing to take a look at it.

The module can be used to analyze groups of repeats of experiments (such as
a time course), taking the Stanford .pcl format as input and outputting in
the GeneCluster format. Only genes that pass the significance test are
output. I am planning on extending the module to include other functions
such as T test or fold change filters as well as XML/format interconversion.

The module depends on Statistics::Distributions and uses some of the
statistics functions from the Perl algorithms book.

To use it:

use ArrayANOVA; (Not a nice name...)

my $anova = new ArrayANOVA(
        filename => "inputfile.txt",
        outputfilename => "outputfile.txt",
        significance_level => "0.01",
        groups => [
                [4,5,6,7],
                [8,9,10],
                [11,12,13,14],
                [15,16,17],
                [18,19,20],
                [21,22,23],
                [24,25,26,27]
                ],
        replace_missing => "1"
        );

$anova->filter_significant_genes();
$anova->outputGeneClusterFormat();

best,

Peter

Dr. med. Peter Robinson
Institut für Medizinische Genetik
Universitätsklinikum Charité
Augustenburger Platz 1
13353 Berlin
Germany

_______________________________________________
Bioperl-l mailing list
Bioperl-l@bioperl.org
http://bioperl.org/mailman/listinfo/bioperl-l