[Biopython-dev] GEO SOFT parser

Sun May 15 14:40:24 UTC 2011

On Sun, May 15, 2011 at 2:13 AM, Phillip Garland wrote:
> Hello,
>
> I've created a new parser for GEO SOFT files- a fairly simple
> line-orientated format used by NCBI's Gene Expression Omnibus for
> holding gene expression data, information about the experimental
> platform used to generate the data, and associated metadata. At the
> moment if parses platform (GPL), series (GSE), sample (GSM), and
> dataset (GDS) files into objects, with access to the metadata, and
> data table entries.
>
> It's accessible through my github biopython repo:
> https://github.com/pgarland/biopython
> git://github.com/pgarland/biopython.git
>
> Branch:
> new-geo-soft-parser
>
> All the changed files are in the Bio/Geo directory.
>
> The existing parser has the virtue of being simple and short. The
> parser I've written is less parsimonious, but should handle everything
> specified by NCBI, as well as some unspecified quirks, and documents
> what GEO SOFT files are expected to contain.

That sounds good, the current GEO parser was very minimal.

> I'm taking a look at Sean
> Davis's GEOquery Bioconductor package for ideas for the interface.

Great - I would have encouraged you to look at Sean's R interface for ideas.
https://github.com/biopython/biopython/tree/master/Tests/Geo

> There is a class for each GEO record type: GSM, GPL, GSE, and GDS.
> After instantiating each of these, you can call the parse method on
> the resulting object to parse the file, e.g.:
>
>>>> from Bio import Geo
>>>> gds858 = Geo.GDS()
>>>> gds858.parse('GDS858_full.soft')

We may want to use read rather than parse for consistency with the
other newish parsers in Biopython, where parse gives an iterator while
read gives a single object.

>
> Each object has a dictionary named 'meta' that contains the file's metadata:
>
>>>> gds858.meta['channel_count']
> 1
>
> Each attribute has a hook to hang a function to perform additional
> parsing of a value, but most values are stored as strings.
>
> There is also a parseMeta() method if you just need the file's
> metadata (the entity attributes and data table column descriptions)
> and not the data table.
>
> There is also a rudimentary __str__ method to print the metadata.
>
> For files that can have data tables (GSM, GPL, and GDS files), there
> is currently just one method for accessing values: getTableValue()
> that takes an ID and a column name and returns the associated value:
>
>>>> gds858.getTableValue(1007_s_at, 'GSM14498')
> 3736.9000000000001
>
> but I will implement other methods to provide more convenient access
> to the data table.
>
> Right now, the data table is just an 2D array and can be accessed like
> any 2D array:
>
> gds858.table[0][2]
> '3736.900'
>
> There are dictionaries for converting between IDs and column names and
> rows and columns:
>
>>>> gds858.idDict['1007_s_at']
> 0
>
>>>> gds858.columnDict['GSM14498']
> 2
>
> It is possible that the underlying representation of the data table
> could change though.

One possibility is a full load versus iterate over the rows approach.
The later would be useful if you only wanted some of the data (e.g.
particular genes), and didn't have enough RAM to load it all in full.

> On my dual-core laptop with 4GB of RAM and a 7200RPM hard drive,
> parsing single files is more than fast enough, but I haven't
> benchmarked it or looked at RAM consumption. If it's a problem for
> computers with less RAM or use cases that require having a lot of GEO
> SOFT objects in memory, I can take a look at changing the data table
> representation.
>
> If this parser is incorporated in BioPython, I'm happy to maintain it.

Excellent :)

> The code is well-commented, but I still need to write the
> documentation. I've tested it on a few files of each type, but I still
> need to write unit tests. Since SOFT files can be fairly large-  a few
> MB gzipped, 10's of MB unzipped, it seems undesirable to package them
> with the biopython source code.

We have a selection of small samples already in the repository
under Tests/GEO - so at very least you can write unit tests using
them.

Also, for an online tests, it would be nice to try Entrez with the
new GEO parser (IIRC, our old parser didn't work nicely with
some of the live data).

> I could make the unit test optional
> and have interested users supply their own files and/or have the test
> download files from NCBI and unzip them.

We've touched on the need for "big data" tests which would be more
targeted at Biopython developers than end users, but not addressed
any framework for this. e.g. SeqIO indexing of large sequence files.

Peter