[Bioperl-l] GEO SOFT Parser?

Allen Day allenday at ucla.edu
Mon May 31 00:39:02 EDT 2004


Hi,

I don't think it would be difficult to roll some pieces of this into a
MicroarrayIO format handler.  The difficult bits will be the
institution/submitter/experiment information.  I intentionally left out
support for these sorts of details in the Bioperl modules and focused
instead on IO for the raw data itself.  IMO you're better off to look to a
standard like MAGE-ML (as opposed to SOFT) if you want to work with data
other than expression levels.

Regarding SOFT in general, why do you want to process this data type?  
Last I looked it was very poor at representing how the expression
measurements were quantitated from the experiment and transformed.  
Furthermore, the details of how the data was transformed vary from
experiment to experiment in GEO.  IMO you're better off to Affymetrix CEL
files or their platform-specific equivalent and DIY.

You also might want to have a look at Bioconductor
(http://www.bioconductor.org), there may be SOFT support already; I'm not
sure.

-Allen


On Mon, 31 May 2004, Gong Wuming wrote:

> Hi Tex. 
> I asked the same question here some days before but got no responce. It is 
> a bit surprising because I thought it should be relatively common problem.
> At first I planned to roll a module for parsing soft format in 
> Bio::Expression::MicroarrayIO::, but then I found it is a difficult for me 
> because many important base classes in Bioperl-Microarray were not 
> implemented yet especially on the feature of expression data. So, I wrote a 
> simple perl script for reading information in soft file into a data 
> strucuture. below is the code. 
> 
> -----------------------------------
> #! /usr/bin/perl
> use strict;
> use warnings;
> my $hash = {};
> my $DATA = ();
> my ($last_domain, $this_domain, $last_mark, $this_mark);
> 
> # Reading file line by line.
> while (<>){
>   chomp;
> 
>   $this_mark = substr($_, 0, 1); # Get line marker: '^', '!' or '#'
> 
>   if ($this_mark =~ /\^|\!/){ # If the line is headed by '^' or '!'.
>     my @attr;
> 
>     # Extract the key-value pair ("key = value")
>     my ($key, $value) = split (/\s+=\s+/, substr($_, 1));
>     ($this_domain, @attr) = split ("_", $key);
>     my $attribute = join ('_', @attr) || 'id';
> 
>     if ($this_mark eq '^' and $last_domain) {
>       my %attribute = %$hash;
>       push (@{$DATA->{$last_domain}}, \%attribute);
>       $hash = {};
>     }
>     $hash->{$attribute} = $value;
>   }elsif ($this_mark eq '#'){
>     my ($field, $desc) = /^#(.+?)\s+=\s+(.+)$/;
>     my ($description, $src) = (split (/;*\s+.+?:\s+/, $desc))[1, 2];
>     push (@{$DATA->{'data'}}, {'field'=>$field, 
> 'description'=>$description, 'src'=>$src, 'value'=>[]});
>   }else{ # Data field.
>     next if /^ID_REF/;
>     my $i = 0;
>     map {push (@{$DATA->{'data'}->[$i++]->{'value'}}, $_)} split (/\t/);
>   }
>   $last_domain = $this_domain;
>   $last_mark = $this_mark;
> }
> -------------------------------------------------------------
> The results were stored in such a data structrure:
> 
> $DATA{
>   'database'=>{
>     'name'=>
>     'institute'=>
>     'web_link'=>
>     'email'=>
>     'ref'=>
>   }
>   'dataset'=>{
>     'id'=>
>     'completeness'=>
>     'description'=>
>     'experiment_type'=>
>     'maximum_probes'=>
>     'order'=>
>     'organism'=>
>     'platform'=>
>     'reference_series'=>
>     'title'=>
>     'total_samples'=>
>     'update_date'=>
>     'value_type'=>
>   }
>   'subset'=>[
>     {
>       'id'=>
>       'description'=>
>       'type'=>
>       'sample=>[]
>     }
>   ]		
>   'data'=>[
>     {
>       field => 
>       description=>
>       src=>
>       value=>[]
>     }
>   ]
> }
> Wuming Gong
> 


More information about the Bioperl-l mailing list