[Bioperl-l] [Bioperl-microarray] SOFT parsers

Thu Dec 28 21:57:03 UTC 2006

Michael Muratet US-Huntsville wrote:
> Sean
>
> Thanks. I did consider the bioconductor package and downloaded your
> write-up after it was recommended by the GEO folks. I've looked at R a
> few times, but I never got proficient at it. I'm a lot better with perl.
>
> I've been looking at MINiML, too. It looked like it might be easier to
> parse the SOFT file since the data is in-line with the attributes and
> I'd have to use a SAX parser (not enough memory for DOM) for MINiML.
>
> NCBI must have parsers to get the data into their databases. Do you know
> what they use?
>   
Michael,

You might want to look more specifically at the MINiML format specs.  
The data tables are stored as separate tab-delimited files with an 
external reference in the XML, so DOM parsing is possible with just a 
few kB of memory.  Of course, to read in all of the data into memory at 
once will take a large amount of memory for some datasets.  If you are 
going to load into a database, I would suggest reading the MINiML using 
DOM and then stepping through the data files one at a time, loading as 
you go.

As for their parsers, I'm not sure what language they use, but writing a 
parser for either SOFT or MINiML isn't at all difficult.  GEO uses a 
very simplified MAGE schema. 

As for R vs. perl, if you are planning on doing analyses of microarray 
data, I would highly suggest looking again at the R/bioconductor 
project.  It will save you reinventing many wheels, such as getting 
annotation like gene ontology and pathways, doing stats, plotting, 
maintaining MIAME-compliant data structures, converting from multiple 
microarray formats, etc. 

Sean