[Biopython-dev] population genetics, SNP data management and more

Tue Oct 6 21:08:38 UTC 2009

Hi all,
I'm the primary author of a suite of tools called GLU (Genotype Library &
Utilities) that seems to have some features that may of of interest to
BioPython developers.  It is implemented in Python, uses NumPy, SciPy,
PyTables (or h5py) and a few other common Python libraries, has
the performance critical portions transcribed in C, and is available as open
source under a BSD-like license.

GLU implements a robust set of data management features for large SNP and
general polymorphism data (human/mammalian for now, since we only support
diploid and haploid genotypes).  We regularly use it to manage datasets with
50 billion of SNP genotypes (>50k samples & > 1M SNPs).  We define our own
on-disk data representations in text, compressed text, and optimized binary
formats, plus support PLINK and about a dozen other common formats.  Our
native binary storage is based on HDF5 and is quite robust and scalable.  As
a point of reference the Phase I-III of the International HapMap data is ~13
GB in their text format, 1.3 GB with gzip compression, and 472 MB in GLU's
HDF5-based LBAT format.

GLU includes modules that compute a range of descriptive statistics on
genotype data quality, concordance,
Mendelian consistency, relationship testing, consistency with
Hardy-Weinberg proportions, and more.  In addition, GLU includes modules to
explore population structure, including estimation of admixture coefficients
(like STRUCTURE, but with fixed source populations and frequencies) and
principle components based on genetic correlations (like EIGENSTRAT and its
ilk).  GLU also allows supports high-throughput association testing between
dichotomous, poloytomous, and continuous (Gaussian) variables and genetic
effects (numerous models), covariates, and arbitrary interactions.  Also
supported is the rapid evaluation of pairwise linkage disequilibrium
statistics and an advanced pairwise SNP tagging algorithm.

There are many other features in GLU, though it is not yet feature complete
and the documentation is currently a bit of a work in progress.  Feel free
to take a look at: http://code.google.com/p/glu-genetics

>From the PopGen wiki, it seems that there is a desire to implement some of
these features within BioPython.  I'm happy to help, contribute code from
GLU where applicable, or at minimum share some of my experiences.

Best regards,
-Kevin Jacobs