[Biopython-dev] GEO library revamp

Erik Clarke erikclarke at gmail.com
Tue May 15 16:44:32 UTC 2012

Hi all,
I saw on the wiki that the BioPython GEO library was in need of some TLC. I agree; a recent effort to use the parser for a project in our lab was stymied by its lack of flexibility (it seems to be particularly poor at reading GEO datasets, for instance).

In response, we've developed a basic GEO module in Python loosely based on GEOQuery and the existing Geo module. Currently, our module is capable of downloading and parsing all four major GEO record types and providing rudimentary pretty-print output of the data. It also provides a representation of a GDS file in a form amenable to statistical analysis using SciPy. I've included a method that finds the enriched genes in a given subset as a demonstration.

Since it was an internal project before this, I would appreciate any feedback in terms of usability, bugs, etc that we may not have caught. It's still under active development as I flesh out some of the missing features (better pretty-printing, bug fixes, complete unit-test coverage, etc).

In any case, my development branch of BioPython is here: https://github.com/eclarke/biopython/tree/GEOQuery, and obviously all of the new code is in the Bio/Geo folder (Records.py will replace Record.py). I've tried to make it as well-commented as possible. I have not yet tested it on Python < 2.7, but I plan on doing so.

If this is of interest to anybody, I would be more than happy to tweak it as people saw fit and hopefully one day replace the current GEO parser. 

Erik Clarke
The Scripps Research Institute
La Jolla, CA

More information about the Biopython-dev mailing list