[Biopython-dev] PopGen.Stats

Mon Nov 17 13:27:51 UTC 2008

After too much thinking and too much delaying (delaying in two
distinct senses: proposal delay and delaying for more than 1 year
doing the module), here is my proposal on how to proceed.

Remembering a few fundamental points:

1. Statistics is the core of population genetics. Never Bio.PopGen
will be relevant without it.
2. The framework should be future proof.
3. The API should be for general use (ie not only based on the cases
developers know of).
4. It is very difficult to a have a broad view on how an API like this
can be used (uses vary population genetics of cancer with micro
arrays/lots of data to conservation genetics of species with a few
samples and little number of loci).

A waterfall approach to development is not only outdated as it would
be quite counter productive. So I have no bureaucratic design document
to provide.
My proposal is to choose a bunch of statistics and tests that are
representative of what people might use and implement them. During the
implementation, through refactoring a reasonable API should take form.
What statistics should be choosen then? What are representative statistics?

I was able to find a list of classifications to start. This list got
some inspiration from the very good Arlequin manual. Here are the
different dimensions that I found:
1. Intra-Population versus Inter-population statistics. Say expected
heterozygosity versus Fst
2. Marker dependent vs Marker independent. Say Allelic range (for
microsatelites only) versus Fis
3. Data type: haployic, genotypic phase unknown, genotypic phase
known, genoptypic dominant, frequency only. Say for expected
heterozygosity frequencies are enough, for observed heterozygosity
genotypic phase unknown data is necessary.
4. Single locus (e.g. allelic richness, ExpHe, Fst) versus multi-loci
(e.g., number of polimorphic sites, LD or EHH)
5. Temporal/longitudinal vs single point in time. Say temporal-Fst versus Fst.
6. Population versus Landscape. This issue I suggest abandon for now.

So, the idea is to choose a set of statistics that elucidate these
points, with a good subset we will have a feeling on how everything
fits together. We implement them and then iterate until the API "feels
good". A suggestion of statistics:

ExpHz non-temporal, intra, single-locus, marker independent, genotypic
- gametic unk
ObsHz non-temporal, intra, single-locus, independent, genotypic - gametic kn
Fst(CW) non-temporal, inter, single-locus, indep, genotypic - gametic unk
temporal-Fst temporal, intra, single-locus, indep, genotypic - gametic unk
LD(D') non-temporal, intra, multi-locus, indep, haplo/geno
Fk temporal, intra, single-locus, indep, geno
S (polimorphic sites), non-temporal, intra, multi-locus, indep, haplo/geno
Alleic range, nt, intra, single-locus, microsat, haplo/geno
EHH, nt, positional
Tajima D, nt, intra, single-locus, sequence/rflp

There is still the issue of tests (say Hardy-Weinberg deviation), but
that can be thought while the rest is being done.

The good news is that the half of the above is already implemented
(exceptions are allelic range, S, Tajima D, EHH - presented in
increasing order of implementation difficulty).

I propose implementing the remaining (I can do that, unless any other
wants to give it a try) and then iterate the API until there is a
rough agreement). This can be done on GIT (BTW, my username there is
tiagoantao). I propose that ability to influence policy is roughly
proportional with the time spent coding/effort done ;) .

PS - I am assuming a sequence is a single locus in my reasoning. Of
course it can be seen (and sometimes is) as a sequence of loci (SNPs).