[Bioperl-l] [Bioperl-microarray] SOFT parsers
Sean Davis
sdavis2 at mail.nih.gov
Thu Dec 28 21:57:03 UTC 2006
Michael Muratet US-Huntsville wrote:
> Sean
>
> Thanks. I did consider the bioconductor package and downloaded your
> write-up after it was recommended by the GEO folks. I've looked at R a
> few times, but I never got proficient at it. I'm a lot better with perl.
>
> I've been looking at MINiML, too. It looked like it might be easier to
> parse the SOFT file since the data is in-line with the attributes and
> I'd have to use a SAX parser (not enough memory for DOM) for MINiML.
>
> NCBI must have parsers to get the data into their databases. Do you know
> what they use?
>
Michael,
You might want to look more specifically at the MINiML format specs.
The data tables are stored as separate tab-delimited files with an
external reference in the XML, so DOM parsing is possible with just a
few kB of memory. Of course, to read in all of the data into memory at
once will take a large amount of memory for some datasets. If you are
going to load into a database, I would suggest reading the MINiML using
DOM and then stepping through the data files one at a time, loading as
you go.
As for their parsers, I'm not sure what language they use, but writing a
parser for either SOFT or MINiML isn't at all difficult. GEO uses a
very simplified MAGE schema.
As for R vs. perl, if you are planning on doing analyses of microarray
data, I would highly suggest looking again at the R/bioconductor
project. It will save you reinventing many wheels, such as getting
annotation like gene ontology and pathways, doing stats, plotting,
maintaining MIAME-compliant data structures, converting from multiple
microarray formats, etc.
Sean
More information about the Bioperl-l
mailing list