[Biopython-dev] Statistics in population genetics module - Part I

Fri Oct 31 02:18:56 UTC 2008

Hi,
Can you please be more specific especially in terms of:
What statistics do you want to compute?
What type of data ?

Obviously these are rather interdependent.

In my experience, the statistic and the data type really dictate how
to proceed. Typically you start with pedigree and data files then add
more files for genetic markers (often chromosome specific) etc. Each
requires a specific format and appropriate links between them. Again
this really depends on what you want to calculate and how you do it.

You will probably find that object orientated approach with
individuals, families, populations, models and data type etc. may
actually be helpful and necessary depending on what you want to do.
This it really help me with QTL mapping code especially the overall
design because you makes think exactly where things should go and that
was far more important than the actual coding.  While some of it is
implicit, separating out some components will be necessary especially
getting population-based statistics for data values recorded on
individuals.

Bruce

On Thu, Oct 30, 2008 at 6:58 PM, Tiago Antão <tiagoantao at gmail.com> wrote:
> Hi,
>
> Statistics is the most important part of population genetics modules.
> In fact one could say that statistics where invented FOR population
> genetics (check http://en.wikipedia.org/wiki/Ronald_Fisher ).
> When I started to work on the population genetics module I decided to
> delay the statistics module a bit, in order to get experience with the
> whole biopython project before committing to do the most important
> thing.
> Irrespective of it is possible or not to link scipy or not, now seems
> to be the time to advance, especially considering that Giovanni is
> interested in participating.
> A few of points need to be said before suggesting on how to put
> statistics in Bio.PopGen
>
> 1. Whatever design is put in, it should be reasonably future proof: in
> a few releases it should not be a good idea to break older code. That
> should be avoided in as much as possible.
> 2. It goes without saying that the code should be useful to everybody
> doing population genetics and not only the authors of Bio.PopGen: all
> kinds of markers and population structures should be accommodatable in
> the future .
> 3. For reasons that I've partially explained on the biopython list, I
> don't think a OO model explicitly based on individuals or populations
> e good (or even necessary)
> 4. Any framework should be more pragmatic than anything else. I would
> envision a typical use case like this
>     a) read data (from a certain data source)
>     b) Do some basic processing (changing individuals or populations,
> converting markers)
>     c) calculate statistics
>     A few comments regarding each of these points:
>     a) data sources, file formats: file formats in population
> genetics exist in large quantities and are essencialy completely
> ad-hoc, most made in a very naive way. Good or BAD, that is what there
> is. The most used format (some kind of de facto standard, GenePop) can
> only be used for frequency-based statistics, for all the rest things
> are fragmented (although, if there are no population structure and the
> data is sequences than standard sequence based formats can be used -
> but from my experience this is a small minority)
>     b) basic processing: This is the point where a OO model of
> individuals and populations would pay, but I think it is not the "meat
> of the issue"
>     c) statistics: there are of every type and for every taste. If
> you want to have an idea of what is out there an interesting place to
> look at is the arlequin3 manual:
> http://cmpg.unibe.ch/software/arlequin3/arlequin31.pdf
> (part of the manual is UI description, but especially starting at page
> 89 - the table there is a good overview - there are descriptions of
> the overall panorama).
>
> With time, and after at least 3 failed attempts to think in terms of
> individuals/populations I started to cristalize around a model
> centered on types of statistics. This model ends up actually having
> implicit models of populations and individuals, and that is, in fact,
> there. It is just implicit and not unified: different kinds of
> statistics have different implicit models.
> The model that I would like to propose, centered around statistics,
> will be the subject of my next email (which I will send in the next
> couple of days - still under design and lost sleep). I might split it
> in 2 parts (concepts and suggestions for implementation).
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>