[BioPython] Adding new database types to EUtils

Sean Davis sdavis2 at mail.nih.gov
Tue Dec 4 13:35:13 UTC 2007


On Dec 4, 2007 6:21 AM, Luca Beltrame <luca.beltrame at unimi.it> wrote:

> Il Tuesday 04 December 2007 12:16:36 Peter ha scritto:
>
> > Once you have downloaded the GEO files, what do you plan to do with
> them?
> > Biopython's GEO parser is very basic...
>
> It was mostly to check their basic description to see if they were
> feasible to
> be included in my current work. As I have a large list of accessions,
> fetching them all at once would reduce the time needed to go through them.
> To
> be more clear, downloading their summary.
>
> > P.S. If you use R/BioConductor, I would recommend Sean Davis' GEOquery
> > for this sort of thing.
>
> I mostly use it when I need to download data set information and
> expression
> levels. For this simpler task, I turned to Python first as GEOquery has
> some
> performance issues on my machine.
>
> I'll take a look at NCBI's EUils and see if they support GEO. Thanks for
> the
> tip.


Thought I would chime in here.  GEOquery definitely does have some
performance issues, some of which I have addressed in the most recent
release.  I have thought about making a python-based version, but I find R a
much more compelling framework for statistical computing and array-based
analyses, despite such tools as Rpy and numpy.  Usage of GEOquery also
requires a bit of understanding of the formats used by GEO, as some of them
are monstrously large.  My goal with GEOquery was to allow full parsing of
even the monstrous files.  However, GEO has recently released a GSEMatrix
format (which GEOquery now handles) that is much faster and easier to parse
(meant specifically for Excel to load), so the largest performance issue,
parsing GSE SOFT files, is now pretty much gone.

EUtils support is, as far as I know, pretty limited for GEO.  Data download
is best accomplished via ftp, generally.  However, if one wants only
Metadata (and not values), then URLs can be constructed against their web
page to get back various formats, including SOFT and, in some cases, XML.
I'm not sure that exactly the same functionality is available via Eutils,
but I think not.

Obviously, GEOquery is open-source and I continue to develop it if there is
a need (and in response to changes by NCBI), so feedback is appreciated.
Also, if there are improvements on the GEO side that would improve its
utility, the folks at GEO do take comments and suggestions pretty seriously,
so feel free to pass comments on to them (or to me and I will do the same).


Sean



More information about the Biopython mailing list