[Biopython-dev] Statistics in population genetics module - Part I

Fri Oct 31 10:03:28 UTC 2008

On Fri, Oct 31, 2008 at 2:18 AM, Bruce Southey <bsouthey at gmail.com> wrote:
> Can you please be more specific especially in terms of:
> What statistics do you want to compute?
> What type of data ?
>
> Obviously these are rather interdependent.

I want a framework that can accommodate all statistics and all types
of data (this will be subject of my next email). I personally am
concerned for now with F statistics, allelic diversity, expected
heterosigosity and such . I.e., frequency based statistics. To put it
in another way: marker-independent. A great deal of studies in
population genetics is actually frequency based. But, I don't want a
particular view of the world (mine or other) to dictate the end
result.
My expectation is that, in a few weeks the statistics above will be in
biopython (they are already implemented in functioning code) but that
that doesn't impair the ability to continue in other directions
(marker-dependent statistics, genome-wide statistics).

> In my experience, the statistic and the data type really dictate how
> to proceed. Typically you start with pedigree and data files then add
> more files for genetic markers (often chromosome specific) etc. Each
> requires a specific format and appropriate links between them. Again
> this really depends on what you want to calculate and how you do it.

I think the key point is precisely that diversity of statistics and
data types, and how the drive the whole thing. I also have found that
different people do completely different things. From people working
with humans with lots of data and money, to people with model species,
to people working in conservation of endangered species. Some people
have thousands of markers and lots of individuals others have 10
individuals and 20 markers ("poor-man" markers like microsatellites).
Not to talk about population and landscape genetics statistics. Or
hierarchical population structure. Not to talk about new sequencing
methods and the creative uses that we are starting to see with them.

> You will probably find that object orientated approach with
> individuals, families, populations, models and data type etc. may
> actually be helpful and necessary depending on what you want to do.

I've tried to implement several OO frameworks with these kinds of
relations and they all failed. They fail precisely because of the
immense diversity of statistics, data-formats and use-cases.
I always ended trashing everything because of a use case/statistic
that would render the model awkward or useless. It is bad over
engineering. Correcting things is not bad, but in biopython we don't
want to break interfaces in every release.
Even if there is a good, future-proof model it will always be either a
poor fit in some situations and have performance problems (performance
is becoming a more serious issue every day).
I think the first approach is thinking: lets do OO with populations,
individuals, ... . But experience in trying to do that will lower the
expectations of what can be delivered.

> This it really help me with QTL mapping code especially the overall
> design because you makes think exactly where things should go and that
> was far more important than the actual coding.  While some of it is
> implicit, separating out some components will be necessary especially
> getting population-based statistics for data values recorded on
> individuals.

Getting a correct future-proof design is above my pay-grade using
concepts like individuals and populations. And I believe is above the
pay grade of 100% of people that I know in this area. I think there is
no need for it anyway. I will try to write about this in the next part
of my emails.